AN INTRODUCTION TO AUDIO AND VISUAL …planetding.org/aboutme/files/AVreview.pdf · Practical...
Transcript of AN INTRODUCTION TO AUDIO AND VISUAL …planetding.org/aboutme/files/AVreview.pdf · Practical...
AN INTRODUCTION TO AUDIO
AND VISUAL RESEARCH AND
APPLICATIONS IN MARKETING
Li Xiao, Hye-jin Kim and Min Ding
ABSTRACT
Purpose – The advancement of multimedia technology has spurred the useof multimedia in business practice. The adoption of audio and visual datawill accelerate as marketing scholars become more aware of the value ofaudio and visual data and the technologies required to reveal insights intomarketing problems. This chapter aims to introduce marketing scholarsinto this field of research.
Design/methodology/approach – This chapter reviews the currenttechnology in audio and visual data analysis and discusses rewardingresearch opportunities in marketing using these data.
Findings – Compared with traditional data like survey and scanner data,audio and visual data provides richer information and is easier to collect.Given these superiority, data availability, feasibility of storage, andincreasing computational power, we believe that these data will contributeto better marketing practices with the help of marketing scholars in thenear future.
Review of Marketing Research, Volume 10, 213–253
Copyright r 2013 by Emerald Group Publishing Limited
All rights of reproduction in any form reserved
ISSN: 1548-6435/doi:10.1108/S1548-6435(2013)0000010012
213
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Practical implications: The adoption of audio and visual data inmarketing practices will help practitioners to get better insights intomarketing problems and thus make better decisions.
Value/originality – This chapter makes first attempt in the marketingliterature to review the current technology in audio and visual dataanalysis and proposes promising applications of such technology. We hopeit will inspire scholars to utilize audio and visual data in marketingresearch.
Keywords: Audio data; video data; computer technology; machinelearning; marketing
INTRODUCTION
Audio data and visual data, both static (i.e., photo) and dynamic (i.e.,video), are now widely available.1 Due to technological advancements, weare now not only able to store vast amounts of audio and visual data oncomputers, but we have also acquired the computational power and know-how to glean insights from such data (Burke, 2005).
Having begun with objectives like voice recognition and face recognition,technology development has yielded applications with great utility. Whiletechnology development associated with audio data does not fall within agiven established domain, technology development associated with visualdata falls within a well-defined field called computer vision, which isconcerned with electronically acquiring, processing, reconstructing, analyz-ing, interpreting, and understanding images (Shapiro & Stockman, 2001;Szeliski, 2011).2 A wide variety of real-world applications incorporatecomputer vision technology, in areas such as artificial intelligence, robotics,biometrics, and marketing.
Traditionally, the domain of technological inquiry has belonged tocomputer scientists and engineers. Although still emergent, the adoption ofaudio and visual technology is becoming more common in the marketingdiscipline. This trend will only accelerate over the coming years as marketingscholars become more aware of the value of audio and visual data and thetechnologies required to reveal insights into marketing problems.
In this chapter, we endeavor to provide an introduction to the workin audio and visual, including data, research methods, and examples of
LI XIAO ET AL.214
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
applications in practice. We then review extant academic literature inmarketing and other business related business disciplines. We conclude bydiscussing a few promising research directions using audio or visual data,including a new field of research that we call artificial empathy (AE).
DATA
In this section, we provide a brief discussion of the major types of audio andvisual data available. The list is not exhaustive, but we hope it will conveythree key points: (a) audio and visual data are abundantly available, (b) theyare rich in information, and (c) though large in size, such data can now bestored on regular personal computers (Table 1).
Audio Data
We restrict our chapter to a specific type of audio data – human voice. Voiceis essential in human communication, not only because it expresses meaning,but also because it conveys a person’s inner state, such as various emotions,uncertainty, and personality. It also plays a big role in how people perceiveand judge other people’s traits and emotional states. According to a studythat was quoted in the Wall Street Journal, ‘‘when we are deciding whetherwe like the person delivering a message, tone of voice accounts for 38% ofour opinion, body language for 55% and the actual words for just 7%’’(Wall Street Journal, 2007).
We can classify audio data into at least three types: (a) those producedand recorded as products which will be ‘‘consumed’’ by the target listeners,(b) audio recordings of daily human communication, and (c) voicerecordings specifically made for the purpose of analysis in order to elicitinformation. Based on traditional data classification in marketing, the firstand second types are secondary data, and the third type is primary data.
Some widely available audio data of the first type includes recorded songsand all broadcasted radio content. The second type includes recordedconversations. Given technological advancements, conversations can takeplace between humans (e.g., communication between a customer and arepresentative at a call center), as well as between humans and machines(software). Smartphones have integrated voice commands extensively andmany voice-based applications have been developed, such as iPhone’s‘‘Siri.’’ The third type includes conversations specifically designed in order
An Introduction to Audio and Visual Research and Applications in Marketing 215
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Table
1.
Key
Terminology.
TechnicalTerm
Definition
Audio
fundamental
frequency
Aperiodic
soundwaveconsistsofmultiple
frequency
components,a.k.a.partials.Thefrequencies
of
most
ofthepartialsare
relatedto
thefrequency
ofthelowestpartialbyaninteger
ratio.The
frequency
ofthislowestpartialisthefundamentalfrequency
ofthewaveform
.
prosodic
features
Variationsin
theparametersoffrequency,amplitude,
andspeedofutterance
(auditory
pitch,
loudness,andtempt),whichpermanentlycharacterizespeech
(Crystal,1966)
speech
recognition
Theconversionofspoken
languageinto
text,especiallybyamachine.
emotionalspeech
recognitionor
speech
emotion
recognition
Thedetectionofem
otionsofspeakersfrom
speech,especiallybyamachine.
Visual
digitalim
age
Adigitalim
ageisatw
o-dim
ensionaldigitaldepictionofavisualperception,in
whichspatial
coordinatesandintensity
valueofeach
elem
entare
allfinite,
discretequantities.
pixel
Pixelsare
thesm
allestelem
ents
thatcomprise
adigitalim
age.
Each
pixel
hasparticularspatial
coordinatesandintensity
value,wherespatialcoordinatesdefinelocationin
theim
ageandintensity
valuedefines
brightness(forcolorim
ages)orgraylevel
(forgrayscale
images).
RGBcolormodel
Acolordigitalim
ageisusuallyrepresentedasanRGB(red,green,blue)
model.Anim
agerepresented
intheRGBcolormodelconsistsofthreecomponentim
ages,oneforeach
primary
color.Itisstored
onacomputerasa3-D
array.
computervision
Computervisionisafieldthatconcernswithelectronicallyacquiring,processing,reconstructing,
analyzing,interpreting,andunderstandingim
ages.
object
recognition
Object
recognitionrefers
tothetask
offindingagiven
object
inanim
age,basedonthesegmentation
results.In
theend,alabel
(e.g.,‘‘humanface’’)canbeassigned
toanobject
basedonitsfeatures.
object
tracking
Object
trackingrefers
tothetask
ofautomaticallyidentifyingandfollowingim
ageelem
entsofinterest
when
they
moveacross
avideo
sequence.
LI XIAO ET AL.216
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
to collect information from voice, such as in a voice-based authenticationsystem, where an audio recording is the explicit objective, not thebyproduct, of the exchange.
In the machine learning literature, because the objective is to better trainthe classifier and improve prediction, various databases, called speechcorpora, contain speech data mapped to corresponding emotion labels. Thisenables researchers to train classifiers and then test and make predictions.Note that in addition to using existing corpora, researchers can createspeech corpora based on their own needs, as long as the speech data aremapped to corresponding emotion labels. Existing speech corpora have beenrecorded in a wide range of environments, from several words spoken in aquiet condition, to multiple sentences spoken in a noisier condition, and realinstances of telephone recordings. Results show that regardless of thesituation and conditions, classifiers significantly underperform humans.
Three types of speech have been observed in the literature. First, natural,or spontaneous speech is collected during real situations. Difficulty ariseswhen labeling natural speech data, because the actual emotion or attitudethat a speaker is experiencing is unknown. In this case, third-party labelerscan classify the data depending on how they perceive the emotions orattitudes in the speech recordings, especially if the objective is to train amachine learning classifier according to how humans perceive speech.Second, simulated or acted speech is recorded from professional actors.Actors can provide speech that conveys ‘‘clean’’ emotions with high arousal,so it is a reliable type of data. Lastly, elicited or induced speech is obtainedwhen scenarios are given to subjects to induce certain types of emotions(Ververidis & Kotropoulos, 2006).
Whether existing or newly collected, digital audio data are typicallycreated by ‘‘sampling’’ information from an original analog signal. Thequality of the sampled sound object depends on factors such as sample sizeand sampling rate. The sample size (or bit depth) is the number of bits(typically 8, 16, or 24) representing each audio signal sample. With more bitsin each sample, the digitized audio can represent the original analog signalmore accurately. The sampling rate is the number of samples per secondextracted from an audio signal to obtain a digital audio file. Commonlyemployed sampling rates are 11,025 kHz, 22,050 kHz, and 44,100 kHz, withhigher sampling rates corresponding to better sound quality. A sound whichhas the quality of a CD is typically obtained by taking 44,100 16-bit samplesper second (Hz) of an analog signal, which means one second of CD qualityaudio requires 1.4 million bits (about 165 kilobytes of data). Thus, a 1terabyte hard drive can store just over 1,800 hours of CD quality audio.3
An Introduction to Audio and Visual Research and Applications in Marketing 217
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Visual Data
We can broadly classify visual data into static (i.e., images) and dynamic(i.e., videos). They are closely associated with each other in our analysis,since video is typically analyzed at the frame level, as static images.Similarly, visual data can also be classified into three types: (a) data speci-fically created for consumption by other people, (b) visual recordings of life,and (c) data recorded for the specific purpose of capturing information. Thefirst two types are secondary data and the third type is primary data.
The first type of visual data includes, but not limited to, photographs,print advertisements, video games, movies, television/online video commer-cials, and other television content. The most recent trend is making user-generated videos available online. User-generated videos have been a greatsource of public videos since the emergence of online sharing web sites,forexample, YouTube. On YouTube, 48 hours of video are uploaded everyminute, and most of the content is uploaded by individuals (Richmond,2012).
The second type of visual data is sometimes called surveillance video,although the word ‘‘surveillance’’ may convey unnecessarily negativeconnotations. We define the second type as data collected to be an objectiverecording of what happened in real life. Although the most abundant datawithin this type comes from surveillance cameras, it can also include othercandid recordings of life, including many home videos, or videos bysomeone who turn on a webcam to record his/her activities in the office.Videos recorded by anthropologists also fall into this category.
The advent of digital technology spurred the use of video surveillance inthe United States beginning in the 1980s (Roberts, 2012). Compared withpreviously used tapes, digital technology provides a faster, clearer, moreefficient and cheaper solution to video surveillance. Surveillance videos havebeen widely used for security purposes. Businesses prone to theft (e.g.,banks, retail stores, gas stations) install video surveillance systems in highcrime areas in order to record activities and/or transactions, and also todeter thieves. Nowadays, ATMs across the United States and in most partsof the world have video cameras installed in them to record all transactions.In many cities, cameras have now been installed outside stores and on thestreets (Levy & Weitz, 2008). In the United Kingdom, cameras have beeninstalled in some taxis in an attempt to reduce violent attacks on drivers(BBC News, 2003).
The third type includes data specifically collected to capture informationthat is only available visually. Some examples of applications using static
LI XIAO ET AL.218
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
images include authentication systems based on face, fingerprint or irisrecognition; optical character recognition systems used by the post office toscan and recognize handwritten zip codes; and magnetic resonance imaging(MRI) and x-ray images for disease diagnosis. Dynamic videos are used inapplications for autonomous driving navigation, and eye tracking foradvertisement copy testing, among others.
An image is a two-dimensional depiction or recording of a visualperception. It may be defined as a two-dimensional function ðx; yÞ, whereðx; yÞ denotes spatial coordinates, and the value of f at any point ðx; yÞ,which is called intensity, is proportional to the brightness (for color images)or gray level (for grayscale images) of the image. When spatial coordinatesðx; yÞ and intensity values f are all finite, discrete quantities, the image f ðx; yÞis called a digital image. A digital image is composed of a finite number ofelements, each of which has a particular location ðx; yÞ and intensity valuef ðx; yÞ. These elements are called pixels.
A digital image could be black and white, grayscale, or color. A blackand white image is stored on a computer as a 2-D array (denoted asX � Y , with spatial coordinates 1 � x � X , 1 � y � Y , and f ðx; yÞ, theintensity value of point ðx; yÞ is a binary variable, with 0 referring to blackand 1 referring to white. A grayscale image is stored on a computer as a2-D array as well. For an 8-bit grayscale image, the intensity value ofpoint ðx; yÞ is an integer between 0 and 255, with 0 referring to black and255 referring to white. A color digital image is usually represented as anRGB (red, green, blue) model. An image represented in the RGB colormodel consists of three component images, one for each primary color. Itis stored on a computer as a 3-D array (denoted as X � Y � 3 (see Fig. 1).The intensity value of point ðx; yÞ is a triplet of values (R,G,B). For a fullcolor 24-bit RGB color image, the values (R,G,B) are integers between 0and 255. For example, a white pixel is represented as (255,255,255), ablack pixel as (0,0,0), a red pixel as (255,0,0), and a magenta pixel as(255,0,255). Color images can also be represented in other color models,such as CMY (cyan, magenta, yellow), CMYK (cyan, magenta, yellow,black) and HSI (hue, saturation, intensity).
Resolution, a term used to describe the clarity and sharpness of an image,refers to the number of pixels contained in a digital image.4 For example, a3-megapixel digital image is typically sized at 2,048� 1,536 pixels. Withoutcompression, a full color 24-bit 3-megapixel image requires close to 9megabytes of computer storage space (2,048 � 1,536 � 24 bits E 9megabytes). An image with higher resolution requires more pixels torepresent it, thus requiring more computer storage space.
An Introduction to Audio and Visual Research and Applications in Marketing 219
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
112 108 123 137 157 185 112 109 119 132 156 184 114 109 121 131 151 183 113 109 125 136 150 180 114 110 126 142 153 180 112 109 128 148 158 182
236 237 232 200 164 156 237 239 233 204 173 161 237 238 235 208 175 163 236 236 236 208 171 161 235 235 233 208 169 159 238 234 230 207 170 162
62 67 93 105 130 165 62 67 83 100 132 166 58 70 90 95 120 158 60 72 98 92 111 160 59 71 101 107 122 159 56 69 106 121 130 161
R:
G:
B:
Digital Image
Pixels
RGB Values of Pixels
Fig. 1. Pixel Representation of a Digital Image.
LI XIAO ET AL.220
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Video is a temporal sequence of still images representing scenes in motion.Digital video is comprised of a series of digital images displayed in rapidsuccession at a constant rate. In the context of video, these images are calledframes. The rate at which frames are displayed is measured in frames persecond (fps). Normally, a one-second video clip is comprised of 15, 24, or 30frames. Low-quality surveillance video is often shot at 15 fps, while higherquality videos such as commercials and movies are often shot at 24 or 30 fps.
The visual quality of a video mainly depends on two things: the resolutionof the image on each frame, and the number of frames per second. Asresolution and fps increase, so do visual quality and video storage spacerequirements. Without compression, a 30-second HDTV commercial in720p, 24 fps format requires approximately 2 gigabytes of storage capacity(1280 � 720 pixels � 24 bits � 24 fps � 30 seconds E 2 gigabytes).Movies are much longer in length and require even higher resolution thanHDTV commercials. Without compression, a 2-hour movie in 1080p and 24fps format would require up to 1 terabyte of storage capacity. That is atremendous amount of space, given that hard disk capacity for a regular PCnowadays is around 500 gigabytes. Thus, reducing the amount of storagespace required for image/video data without sacrificing much visual qualityhas become a hot area in digital image processing, which is called imagecompression.
Image compression is the process of reducing or eliminating redundancyand/or irrelevance in image data in order to store or transmit data in anefficient form (Gonzalez & Woods, 2008). For example, for static images,.jpg and .png formats indicate different levels of compression. Aftercompression, the storage space needed for a 3-megapixel full color imagereduces from 9 megabytes to approximately 2.5 megabytes in .png format,or approximately 1 megabyte in .jpg format. For dynamic videos, .wmv and.mp4 are examples of video compression formats. After compression, thestorage space needed for a 30-second 720p, 24 fps HDTV commercialreduces from 2 gigabytes to approximately 9 megabytes in .wmv format, andapproximately 5 megabytes in .mp4 format. Similarly, a 1080p, 24 fps moviecan be stored with approximately 2 gigabytes of disk space aftercompression.
RESEARCH METHODS
In general, analysis of audio or visual data follows a three-step process.First, raw data is acquired/recorded, preprocessed and cleaned. Second,
An Introduction to Audio and Visual Research and Applications in Marketing 221
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
relevant features are extracted from the cleaned raw data to prepare it forprocessing. Third, machine-learning tools are used to obtain usefulinformation based on the extracted features. The first and second stepsare unique to audio or visual data, and we will discuss them separately inthis section. In the last subsection, we will describe the underlying machinelearning tools, which are the same for both types of data, although differentalgorithms are used in different applications.
Audio Data Processing
A discussion about audio data processing requires a basic understanding ofhow human speech is produced. Producing human speech requirescoordination of three different areas of the body: the lungs, the vocal foldsand larynx, and the articulators. The lungs act as a pump to produce airflowand air pressure, which is the fuel of vocal sound. The vocal folds within thelarynx vibrate and interrupt the airflow produced from the lungs to producean audible sound source. The muscles of the larynx can produce differentpitches and tones by varying the tension and vibration of the vocal folds.Then, the vocal sound can be emitted through the nasal or oral cavities. Atthis stage, we can differentiate between the nasal consonants (/m/, /n/, /Z/)and other sounds. The articulators, which consist of the tongue, palate,cheeks, lips, etc., form and filter the sound being emitted from the vocalfolds and interact with the airflow to strengthen or weaken the sound. Forexample, the lips can be pressed together to produce the sounds /p/ and /b/,or be brought into contact with the teeth to produce /f/ and /v/. Vowels areproduced when the air is allowed to pass relatively freely over different lipand tongue positions. For example, /i/ as in the word ‘‘beat’’ is producedwhen the tongue is raised and pushed forward, and /a/ as in ‘‘bar’’ isproduced when the tongue is lowered and pulled back.
When the voice exits the mouth, it is transmitted in space over time in theform of a soundwave. This soundwave, or signal, can be represented as afunction of time, which is the traditional way of observing signals, and iscalled the time domain. The pitch of the voice is how high or low the voicesounds, as perceived by a human, represented by the frequency of thesoundwave. The intensity of the voice is how loud or soft a voice sounds,represented by the amplitude of the soundwave (Fig. 2).
It has been proven by Fourier that any waveform can be represented as asum of different sine waves. And because some high frequency sine wavesmay be important to examine but are unobservable in the time domain,
LI XIAO ET AL.222
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
sometimes it is more useful to look at the soundwave from anotherperspective, called the frequency domain. The waveform in Fig. 3 can berepresented as the sum of two sine waves, one with a high amplitude and lowfrequency, and one with a low amplitude and high frequency. These two sinewaves can be converted to the frequency domain. A given function or signalcan be converted between the time and frequency domains with a pair ofmathematical operators called a transform. An example is the Fouriertransform, which decomposes a function into the sum of a (potentiallyinfinite) number of sine wave frequency components. The time domain andthe frequency domain are two different viewpoints of examining the audiosignal, and the signal can also be viewed ‘‘top-down,’’ which is a visualrepresentation used in audio processing softwares called a spectrogram.
Once audio data are recorded, the first step is to preprocess it byconverting the analog signal into a digital signal and filtering out the noise.One common approach in the time or frequency domain is to enhance theinput signal through a method called filtering. Digital filtering generallyrefers to the linear transformation of a number of samples surrounding thecurrent input sample or output signal. There are various ways tocharacterize filters, but that discussion is beyond the scope of this chapter.
Fig. 2. Basic Terms.
An Introduction to Audio and Visual Research and Applications in Marketing 223
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Depending on the task at hand, digital signal processing applications can berun on general purpose computers using specialized software, more generalsoftware such as MATLAB with an appropriate toolbox (i.e., digital signalprocessing toolbox), or with embedded processors that include specializedmicroprocessors called digital signal processors.
Once the audio recording is cleaned, the next step is to extract voicefeatures. Typically, audio information is first cut into frames for para-linguistic analysis, where the rule of thumb is 10–40 milliseconds per frame,and features are estimated on a frame-by-frame basis, assuming stationarity.The major features to be extracted include pitch, intensity, temporalfeatures, and the slope (first derivative) of pitch and intensity. Numerousalgorithms are available to extract these features.
Pitch is determined by the fundamental frequency. A periodic soundwaveconsists of multiple frequency components, which are known as partials.Most partials are harmonically related, and the frequencies of most of thepartials are related to the frequency of the lowest partial by an integer ratio.
Fig. 3. Time Domain and Frequency Domain.
LI XIAO ET AL.224
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
The frequency of this lowest partial is defined as the fundamental frequencyof the waveform. Various algorithms in pitch extraction are in either thetime domain or the frequency domain.
One example of calculating fundamental frequency in the time domain isbased on autocorrelation. For time signal x(t), the autocorrelation is:
rxðtÞ �Z
xðtÞxðtþ tÞdt; for lag t (1)
Fundamental frequency is defined as F0= 1/T0, and T0 is a lag, wherethere is a global maxima. Intensity is often calculated as the root meansquare of the amplitude of the soundwave:
RMS ¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1
t2�t1
Z t2
t1
x2ðtÞdt;
sfor time ðt1; t2Þ (2)
Temporal features include, but are not limited to, total duration, speakingduration, silence duration, percent of silenced regions, and the speaking rate(number of syllables/speaking duration).
For feature extraction purposes, there is a freely available softwareapplication, Praat (the Dutch word for ‘‘talk’’), which was developed as aphonetic speech analysis tool by researchers from the Institute of PhoneticSciences at the University of Amsterdam. It has a point-and-click interfaceand can be run using scripts. Researchers can also use MATLAB by writingtheir own codes, or use several toolboxes developed by other researcherswhich provide routines to extract various features (VOICEBOX, COLEA,VoiceSauce, etc.).
Visual Data Processing
As described previously, digital images are stored as arrays of discrete dataon a computer. When processing a digital image, the computer is actuallyprocessing a data array that represents the image. How to extract usefulinformation that is interpretable by humans from such data arrays is oneprimary objective of computer vision (Szeliski, 2011). Visual data processingsystems are highly application dependent (Gonzalez & Woods, 2008;Shapiro & Stockman, 2001). A typical visual data processing system maycontain five fundamental steps: image acquisition, preprocessing, segmenta-tion, feature extraction, and high level processing.
An Introduction to Audio and Visual Research and Applications in Marketing 225
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Image acquisition refers to the creation of a digital image. A digital imagecould be acquired by a camera, a radar, a tomography device, an electronmicroscope, extracting a frame from video, or simply receiving an imagethat is already in digital form.
The second step of image processing is preprocessing the visual data,which is actually the first step of data analysis, as it is for audio data. Theobjective of image preprocessing is to improve the image data quality forlater analysis. Image preprocessing may involve image conversion (e.g.,from RGB to grayscale), noise reduction (e.g., illumination compensation),normalization, etc. Similar with audio data, a Fourier transform can beapplied to a 2-D image, transforming it from the spatial domain to thefrequency domain. With carefully chosen parameters, frequency domainfilters can be very efficient in image enhancement and noise reduction(Gonzalez & Woods, 2008).
Segmentation is the process of partitioning a digital image into consti-tuent regions or segments (sets of pixels), where pixels partitioned in thesame segment share certain visual characteristics. The goal of segmentationis to simplify and/or change the representation of an image into somethingthat is more meaningful and easier to analyze by computer (Shapiro &Stockman, 2001). Segmentation accuracy usually determines the eventualsuccess or failure of high level processing, for example, an object recognitiontask. Generally speaking, the more accurate the segmentation, the morelikely recognition is to succeed (Gonzalez & Woods, 2008). The Fouriertransform can also be used for segmentation purposes. Some low level imageprocessing (in which both inputs and outputs are images) may finish withsegmentation, for example when turning 2-D pictures into a 3-D model.
When a high level of image processing is pursued (in which the input is animage, but the output is understanding/interpretation), the next step isfeature extraction. The output of the segmentation stage is usually raw pixelvalues, which must be converted into a form suitable for computer toprocess. Feature extraction deals with extracting and selecting attributesthat lead to some quantitative information of interest. The features to beextracted depend on the problem being solved. Examples of such featuresinclude eigenface features in a face recognition task (Turk & Pentland, 1990)and SIFT features in a symmetry detection task (Loy & Eklundh, 2006).
Once appropriate features are extracted, the final step is high levelprocessing. The objective of this step is to reveal information orunderstanding that is interpretable by humans. Here, we limit our discussionto two prevalent high level processing tasks, namely object recognition andobject tracking. Object recognition refers to the task of finding a given
LI XIAO ET AL.226
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
object in an image, based on the segmentation results. In the end, a label(e.g., ‘‘human face’’) can be assigned to an object based on its features.Object tracking is ‘‘following image elements moving across a videosequence automatically’’ (Trucco & Plakas, 2006, p. 520). The general ideaof object tracking is to recognize one or multiple objects of interest in eachframe of the video and then link the recognized objects from one frame toanother so as to estimate their trajectory in the image plane as they movearound a scene. Object recognition in each frame can be done indepen-dently, or more often dependently, because the focal object appears withhigh probability in the adjacent area to its location in previous frame. It isworth noting that some applications can use either object recognition orvideo tracking. For example, facial expression can be recognized either byclassifying facial expressions on each single image or by tracking facialpoints of interest on a temporal sequence of images to identify the facialexpression (Fasel & Luettin, 2003).
MATLAB is a popular software application used for image processingand computer vision tasks. Its Image Processing Toolbox is developedspecifically for digital image processing purpose. This toolbox provides alarge set of algorithms and tools for image processing, analysis, visualiza-tion, and algorithm development (www.mathworks.com). Another powerfultool is OpenCV, an open source computer vision library originally launchedby Intel. It contains more than 2,500 algorithms and the number keepsgrowing (www.opencv.org). It is compatible with various operating systems,including Windows, Linux, Mac OS X, Android, and iOS with interfacesincluding C, C++, Python, and Java.
Machine Learning
Machine learning is a field that includes methods and algorithms that allowcomputers to learn something from data (Bishop, 2006). Samuel (1959)defined machine learning vaguely as a ‘‘field of study that gives computersthe ability to learn without being explicitly programmed.’’ Mitchell (1997)proposed a more formal and technical definition: ‘‘A computer program issaid to learn from experience E with respect to some class of tasks T andperformance measure P, if its performance at tasks in T, as measured by P,improves with experience E’’ (Mitchell, 1997, p. 2). According to Mitchell(2006), ‘‘Machine Learning is a natural outgrowth of the intersection ofComputer Science and Statistics’’ (Mitchell, 2006, p. 1). Although machinelearning overlaps significantly with statistics, it is still a distinct discipline.
An Introduction to Audio and Visual Research and Applications in Marketing 227
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
‘‘Whereas Statistics has focused primarily on what conclusions can beinferred from data, Machine Learning incorporates additional questionsabout what computational architectures and algorithms can be used to mosteffectively capture, store, index, retrieve and merge these data, how multiplelearning subtasks can be orchestrated in a larger system, and questions ofcomputational tractability’’ (Mitchell, 2006, p. 1). Datasets used formachine learning are usually nonstationary, dependent and large in size,whereas standard statistical techniques always require clean data sampledindependently from the same distribution, making them inefficient whenhandling very large datasets (Hand, 1998). Now that computational powerhas increased, analysis of audio and visual data is typically accomplished byusing various algorithms based on machine learning. Machine learning isuniquely suited for handling these tasks, because it can handle extremelylarge datasets.
Machine learning is a vast field. In general, there are three types ofmachine learning: supervised learning, unsupervised learning, and reinforce-ment learning.5 Supervised learning is the task of inferring a model fromlabeled training data. The training data is comprised of two parts: the inputvariables (X’s) and their corresponding target variables (y’s). When thetarget variable is categorical, the supervised learning task is calledclassification; when the target variable is continuous, the task is calledregression. Here we focus on classification, since regression is a familiarconcept in the marketing field. In the classification task, the model inferredfrom the training data is called a classifier. Once a classifier is trained withtraining data, it is applied to test data (a different dataset from the trainingdata) to infer or predict labels. Supervised learning is widely used forrecognition tasks, such as speech recognition (Klevans & Rodman, 1997),speech emotion recognition (Lee & Narayanan, 2005), face recognition(Turk & Pentland, 1991; Zhao, Chellappa, Phillips, & Rosenfeld, 2003),facial expression recognition (Fasel & Luettin, 2003), handwritten digitrecognition (Liu, Nakashima, Sako, & Fujisawa, 2003), and tumor detection(Sajda, 2006), among others.6 For example, in a handwritten digitrecognition task, training data is a set of scanned images of handwrittendigits, where the input variables (X’s) are features extracted from thetraining images, and the target variables (y’s) are the true digitscorresponding to the training images, namely 0, 1,y, 9. Ideally, the sizeof the training dataset should be large enough to include sufficientrepresentative examples. The classifier is trained to infer y’s from X’s basedon the training data. Given a new set of images (the test data), the same setof features is extracted from the test images as from the input variables, and
LI XIAO ET AL.228
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
the inferred classifier is used to assign labels to the test images, basicallyclassifying each test image into one of the ten classes labeled with 0, 1,y, 9.Similarly, in speech keyword recognition tasks, input variables such as pitch,intensity, and temporal features are extracted from training data, and theclassifier is trained along with the target keywords and then tested on a newspeech dataset to check the performance of the classifier.
There is a large set of classifiers that could be used for supervised learning(Bishop, 2006; Mitchell, 1997). It is difficult to provide a complete list, sohere we only provide a few examples for introductory purposes. Classifierssuch as linear discriminant classifiers (LDCs), logistic regression andk-nearest neighbor (kNN) classifiers are simple and have been used for along time (Lee & Narayanan, 2005; Lehmann et al., 2005; Litman & Forbes,2003; Shami & Verhelst, 2007). Decision trees (CART, classification anddecision tree) are another simple and widely used method due to the factthat the result of decision tree classification is intuitive and easy to interpret(Murthy, 1998). However, these algorithms suffer from the ‘‘curse ofdimensionality.’’ When the feature dimensions increase quickly, availabledata becomes sparse. Also, kNN is sensitive to outliers. Support vectormachines (SVMs) are a natural extension of LDCs, with the goal of findingthe widest gap that divides the data points into different classes. SVMsprovide good generalization properties and are adopted very frequentlyin the literature (Ganapathiraju, Hamaker, & Picone, 2004; Morrison,Wang, & De Silva, 2007; Yu, Chang, Xu, & Shum, 2001). Hidden Markovmodels (HMMs), which model the temporal evolution of a signal with anunderlying Markov process, are also extensively used in speech analytics,because they are a natural representation of speech in the time domain(Juang & Rabiner, 1991; Nwe, Foo, & De Silva, 2003). Neural networks areanother effective and widely used classification method (Zhang, 2000) thatdevelop hierarchical structured classifiers capable of handling more complexsituations. For example, Pomerleau (1993) used a neural network to train arobotic vehicle to drive autonomously. The neural network system firstanalyzes the digital image acquired by an installed camera and estimates thetype of road that the robotic vehicle is driving on (e.g., single lane,multilane, dirt), and then navigates based on the estimated road type. Thesemodels also have simple algorithmic structures, making them straightfor-ward to implement with fairly good performance (Nakatsu, Nicholson, &Tosa, 2000; Schuller et al., 2009d). Deciding which classification method touse depends on various factors, such as: (a) how to treat high dimensionalityin the feature space, (b) applicability to small datasets, and (c) how tohandle skewed classes (Schuller, Anton, Steidl, & Seppi, 2011).
An Introduction to Audio and Visual Research and Applications in Marketing 229
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Unsupervised learning is the task of inferring a model from unlabeledtraining data. In unsupervised learning, the training data contains onlyinput variables (X’s), not corresponding target variables (y’s). Unsupervisedlearning is often used for segmentation tasks, such as speaker segregationand identification (Gish, Siu, & Rohlicek, 1991), image segmentation(Zhang, Fritts, & Goldman, 2008), and graph partitioning (Shi & Malik,2000). Some scholars propose to apply unsupervised learning to object classrecognition tasks as well (e.g., Fergus, Perona, & Zisserman, 2003).
There is also a large set of algorithms that can be used for unsupervisedlearning (Bishop, 2006; Mitchell, 1997). Here, we introduce only a fewexamples. K-means clustering is a very simple unsupervised learningalgorithm that can be used for image compression and image segmentation(Bishop, 2006). The general idea of K-means clustering is to partition theobservations into clusters in which each observation belongs to the clusterwith the nearest mean. K-means is a special case of expectation-maximization (EM) algorithms. Some other EM algorithms also have beenproposed for image reconstruction and image segmentation (Friedman &Russell, 1997; Zhang, Brady, & Smith, 2001). Independent componentanalysis (ICA) is another effective and widely used segmentation method forboth audio and visual data (Jenssen & Eltoft, 2003; Lee & Lewicki, 2002).KNN and HMM algorithms can also be used for unsupervised learningtasks (Nefian & Hayes, 1998; Weinberger & Saul, 2006). For most of theabove mentioned unsupervised learning algorithms, the number ofsegments/clusters must be defined prior to running the algorithm based onthe application. Too many segments may result in coarse solutions for somealgorithms such as K-means and EM (Tatiraju & Mehta, 2008).
Reinforcement learning is concerned with how an agent learns andinteracts with a dynamic environment so as to maximize the numericalreward signal (Kaelbling, Littman, & Moore, 1996; Sutton & Barto, 1998).The main difference between supervised/unsupervised learning and reinfor-cement learning is threefold. First, reinforcement learning requires one ormultiple agents to take a sequence of steps, whereas supervised andunsupervised learning is one shot. Second, supervised learning offerscomplete information about the target variable and unsupervised learningoffers no information, whereas reinforcement only offers a rewardsignal, with which the feedback is delayed, not instantaneous. Third, inreinforcement learning, the agent must actively interact with the dynamicenvironment, learning from trial-and-error interactions and taking actions,which is not the case in supervised or unsupervised learning. Reinforcementlearning is also used in audio/visual data processing, but not as widely as
LI XIAO ET AL.230
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
supervised and unsupervised learning thus far. For example, Asada, Noda,Tawaratsumida, and Hosoda (1996) proposed a method of vision-basedreinforcement learning by which a robot learns to shoot a ball into a goal,and later Asada, Uchibe, and Hosoda (1999) extended this learning systemand used it on multiple robots. Some scholars have applied reinforcementlearning to object recognition (Paletta & Pinz, 2000; Peng & Bhanu, 1998)and speech recognition tasks (Lee & Mahajan, 1990; Singh, Kearns,Litman, & Walker, 1999).
Due to widespread application of machine learning in both academicresearch and real life practice, a large variety of software is available toimplement machine learning algorithms, such as R, MATLAB, andOpenCV. R is an open source programming language and softwareenvironment originally developed for statistical computing and graphics.R contains many packages that could be used for machine learningimplementation, such as the ‘‘nnet’’ package for implementing neuralnetwork algorithms, ‘‘rpart’’ for decision trees, etc. Several toolboxes inMATLAB can be used for machine learning, such as statistics toolbox,neural networks toolbox, etc. OpenCV contains a library called machinelearning library (MLL), which was specifically developed for machinelearning implementation.
OVERVIEW OF APPLICATIONS
In this section, we provide a brief review of applications that use eitheraudio or visual data. As applications are by nature diverse, we do notattempt to be exhaustive. The objective is to describe some of the mostcommon applications in practice and to inspire marketing scholars to usethem. Existing applications using audio or visual data are discussed in thenext section.
Applications Based on Audio Data
Research in voice/speech can be divided into two large subareas dependingon the aspect of speech on which the research focuses (Cowie et al., 2001).First is the subarea that focuses on linguistics, the language and the contentthat is being conveyed by voice. Speech recognition (or speech-to-text), andkeyword recognition are included in this field. The second subarea focuseson paralinguistics, or how words are spoken (i.e., the prosodic features,
An Introduction to Audio and Visual Research and Applications in Marketing 231
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
which refer to the pitch, intensity, temporal aspects of voice) without regardto content. Paralinguistics can be used to recognize people’s traits andpersonalities, detect uncertainty and various emotions, and even capturedeception. Both areas of voice/speech analysis bring various disciplinestogether, such as psychology, digital signal processing in electricalengineering, and computer science. We review literature in psychology,communications, and computer science, and discuss one example applica-tion using linguistics (speech recognition), and three example applicationsusing paralinguistics (detection of uncertainty, emotion, and deception)(Table 2).
Speech recognition is the most widely used application of audio dataanalysis. The goal of an automatic speech recognition (ASR) system is toconvert speech data into text form. The ultimate goal is to perceive speechon par with a human listener, independent of speaker and conditionalfactors (i.e., background noise). The first speech recognition system wasdeveloped by researchers at AT&T Bell Labs. This system was able to detectthe numerical digits 0 through 9 in English. Here, classification wasdependent on the speaker, that is, the reference data for each number werecollected from a particular speaker and later compared with that samespeaker’s speech data (Klevans & Rodman, 1997). It is difficult to measureprogress in speech recognition performance because contexts, environments,and tasks vary dramatically. Commercially available software applicationsseem to perform fairly well in a speaker-dependent sense, after being‘‘trained’’ by an individual to recognize his speech. However, if the
Table 2. Speech Applications.
Applications Applications in
Marketing
Methods Used Selected Papers
Speech recognition Smartphone
applications
Human judgment,
Feature extraction
(pitch, intensity,
temporal),
Classification
Benzeghiba et al.
(2007)
Uncertainty Survey responses,
Interviews, Focus
groups
Pon-Berry and
Shieber (2011)
Emotion recognition Call-center applications Ververidis and
Kotropoulos
(2006)
Deception Survey responses,
Interviews, Focus
groups
Hirschberg et al.
(2005)
LI XIAO ET AL.232
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
application is used for a new speaker, performance is limited. A commonexample is automatic call center applications, which recognize words, digits,and short phrases with a fairly high error rate. A thorough review of theASR literature can be found in Benzeghiba et al. (2007).
Unlike linguistic applications like speech recognition above, paralinguisticapplications are much more subtle, and effectiveness varies widely. Arelatively well-researched application in this area is the detection ofuncertainty. In a study by Smith and Clark (1993) in which an experimenterasked participants general knowledge questions, the participants producedhedges, filled pauses, and exhibited rising intonation contours when theyhad a lower feeling of knowing (FOK). Brennan and Williams (1995)examined whether listeners are sensitive to filled pauses and prosody whichare used by speakers to display their metacognitive states. For answers,rising intonation and longer latencies led to lower perception of feeling ofanother’s knowing (FOAK) by listeners. Filled pauses led to lower ratingsfor answers and higher ratings for nonanswers (‘‘I don’t know/I can’tremember’’) than did unfilled pauses.
Pon-Berry (2008) examined which prosodic features are associated with aspeaker’s level of certainty and where these prosodic manifestations occurrelative to the location of the word/phrase that the speaker is confident oruncertain about. For whole utterances, temporal features (silence andduration) are most strongly associated with perceived level of uncertainty.Certain prosodic cues regarding uncertainty are localized in the targetregion (i.e., the word or words that a speaker is uncertain about, such aspercent silence) while other prosodic cues are manifested in the surroundingcontext (such as range of pitch). Other applications in the computer scienceand education literature aim to detect uncertainty in spoken dialoguecomputer tutor systems so machines can adapt to a user’s state ofuncertainty (Forbes-Riley & Litman, 2011; Xiong, Litman, & Marai, 2009).
Another relatively well-studied application is the detection of variousemotions. This domain is particularly aligned with affective computing, arapidly growing domain, which aims to build machines that recognize,express, model, communicate, and respond to human emotions (Picard,2003). The goal of emotional speech recognition is to teach a machine todetect human emotions based on voice.7 Dai, Fell, and MacAuslan (2009)used three perceptual dimensions for human judgment: valence, potency,and activation. The basic assumption is that emotions can be classified intodiscrete categories.
Research on emotional speech recognition is limited to certain emotions.The majority of emotional speech data collection includes five or six
An Introduction to Audio and Visual Research and Applications in Marketing 233
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
emotions, although in real life it is difficult to categorize them discretely. It isassumed that some basic emotions are more primitive and universal thanothers, such as anger, fear, sadness, sensory pleasure, amusement,satisfaction, contentment, excitement, disgust, contempt, pride, shame,guilt, embarrassment, and relief (Ekman, 1999). Murray and Arnott (1993)summarized prosodic and vocal features associated with various emotions.For example, anger and disgust can be contrasted; anger has a slightly fasterspeech rate, a much higher average pitch, higher intensity, and abrupt pitchchanges with a breathy and chest tone, whereas disgust has a very muchslower speech rate, a much lower average pitch, lower intensity, and wideand downward pitch changes with a grumbled chest tone. Ang, Dhillon,Krupski, Shriberg, and Stolcke (2002) detected annoyance and frustrationfrom not only prosodic features which were computed automatically, butalso hand-marked speaking style features (such as hyperarticulation,pausing, or raised voice), and found that only raised voice is a meaningfulpredictor among the speaking style features.
Various contexts in detecting emotions have been studied. Some real lifesituations range from oral interviews of soldiers in front of a board ofsuperiors being evaluated for promotions (Hansen, Kim, Rahurkar,Ruzanski, & Meyerhoff, 2011), and parents taking to infants to keep themaway from dangerous situations (Slaney & McRoberts, 2003), to conversa-tions between patients and therapists to detect depression and suicidaltisk (France, Shiavi, Silverman, Silverman, & Wilkes, 2000). Ranganath,Jurafsky, and McFarland (2009) examined flirtation using prosodic andlexical features in the context of speed dates and found that humans are badat detecting flirtation. Other studies examined speech in the context ofvarious technologies. Call center contexts are commonly used in research(Burkhardt, Ajmera, Englert, Stegmann, & Burleson, 2006; Morrison et al.,2007). Lee and Narayanan (2005) used prosodic, lexical, and discourseinformation to detect emotions while a subject talked to a machine (e.g.,during telephone calls to ASR call centers). Some contexts involved subjects,including children, who interacted with computer agents or robots (Batlineret al., 2004; Nakatsu et al., 2000). In others, subjects were studied in stressfulsituations (e.g., the subject drives a car at various speeds and adds numbersat the same time) (Fernandez & Picard, 2003).
Another example of a paralinguistic application is detecting deception.We include this here to illustrate the fine line one must walk when applyingparalinguistics in this way. Sometimes the desired information may simplybe too hard to infer from audio, as is often the case with deception. It isan area that has gained interest in criminology, communication, and other
LI XIAO ET AL.234
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
fields, but also has sparked controversy as to its efficacy. A meta-analysis(Bond & DePaulo, 2006) of 206 studies shows that humans are relativelypoor at detecting deception, performing at chance level on average. Thus, incontrast to speech recognition, emotion recognition and uncertaintydetection, in which human performance is often considered the highestbaseline in evaluating machine performance, there are applications whichattempt to detect deception better than the average human (Enos et al.,2006; Hirschberg et al., 2005).
Several studies have attempted to identify cues to deception. Oneparalinguistic feature that has been associated with deception is higherpitch (Ekman, O’Sullivan, Friesen, & Scherer, 1991), but this may be truerfor people who are more motivated to deceive (Streeter, Krauss, Geller,Olson, & Apple, 1977). Also, Rockwell, Buller, and Burgoon (1997)reported that people who are lying are associated with shorter messages,longer response times, slower speaking rates, less fluency, increased intensityrange and pitch variance, and a less pleasant vocal quality compared withpeople who are telling the truth. Anolli and Ciceri (1997) also reportedhigher pitch from liars, along with more pauses and words, eloquence anddisfluency.
Linguistic cues have also been studied. Newman, Matthew, Pennebaker,Berry, and Richards (2003) applied a computer-based text analysis program(which analyzes text across 72 linguistic dimensions) to texts from fiveindependent sample studies in various combinations. They reported thatcompared to people who are telling the truth, liars show lower cognitivecomplexity, use less self-references and other-references, and use morewords with negative valence. Other studies also report that deceiving peopleand truth-telling people adopt different word patterns (Burgoon & Qin,2005; Zhou, Burgoon, Twitchell, Qin, & Nunamaker, 2004), which suggeststhat it may be useful to analyze linguistic content to detect deception.DePaulo et al. (2003) examined a total of 158 cues of deception in 120independent samples and reported that 16 linguistic or paralinguistic cuesappeared significant in multiple studies. However, no cue has been said toreliably recognize deception across all contexts, subjects, and situations(Enos, 2009).
Another method of detecting deception that is neither linguistic orparalinguistic is the voice stress analyzer (VSA), which has beencommercially marketed. It is said to detect deception by capturing ‘‘stress’’inherent in lying by microtremors in the vocal cords. However, anindependent study examining the effectiveness of this method has failed tosupport these claims (Haddad, Walter, Ratley, & Smith, 2001).
An Introduction to Audio and Visual Research and Applications in Marketing 235
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Applications Using Visual Data
In this subsection, we discuss a few important applications that use visualdata, recognized as part of the computer vision discipline. The applicationsusing static data (images) involve detection of important human featuressuch as face, facial expression, iris, eye gaze, fingerprint, skin color, handgeometry, etc. Some applications involve more complex feature detection orrecognition, such as gender classification. And some involve recognition ofnonhuman objects, such as automobiles. Applications using dynamic data(videos) detect movement of human features movement to recognize facialexpressions or gestures, count people, and determine a trajectory, amongothers.
The applications of static and dynamic data that we discuss here fall intotwo broad application areas called object recognition and object tracking,respectively. Both were introduced previously in the ‘‘Visual DataProcessing’’ section. Although a large number of applications exist in eachfield, here we restrict our discussion to the five most relevant applications tomarketing practice due to space constraints: face recognition, facialexpression recognition, eye gaze detection, gesture recognition, and humantrajectory detection (Table 3).
Face recognition is ‘‘one of the most successful applications of imageanalysis and understanding’’ (Zhao et al., 2003, p. 400) and has receivedsignificant attention from both practitioners and academics. It is the processof automatically identifying a person from a digital image, which is alwaysdone by comparing the focal face image with a large face database. Facerecognition has been applied in many areas, such as suspect tracking andinvestigation in law enforcement, shoplifting prevention and advancedCCTV control in retail surveillance, video games, and recently, face deals inmarketing. Face deal is a Facebook appliance that relies on face recognitiontechnology. Digital cameras equipped with face recognition devices areinstalled at local businesses. When a customer passes by, the cameras takepictures of the customer and the face recognition device analyzes thecustomer’ face and links to his or her authorized Facebook account. Thenthe customer is checked in at the location, and simultaneously, informationof a customized deal based on the customer’s Facebook ‘‘Like’’ history issent to the customer’s phone (http://redpepperland.com/lab/details/check-in-with-your-face).
Generally speaking, the process of automatic face recognition involvesthree major steps: face detection, feature extraction, and recognition/identification. A large body of algorithms has been developed for each step
LI XIAO ET AL.236
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Table
3.
Image/Video
Applications.
Applications
Field
Applicationsin
Marketing
ExamplesofIm
plementation
SelectedReview
Papers
Face recognition
Object
recognition
Face
deals
Feature
extraction:Eigenface,fisherface
Classification:KNN,SVM
Zhaoet
al.(2003)
Facial
expression
recognition
Object
recognition;
video
tracking
Computerizedsales
assistantsystem
Feature
extraction:Eigenface,LBP
Classification:HMM,neuralnetworks
FaselandLuettin(2003)
Eyegaze
detection
Object
recognition;
video
tracking
Eyetracking
Feature
extraction:Eigeneye
HansenandJi
(2010)
Gesture
recognition
Video
tracking
Kinnect
Feature
extraction:Spatialfeatures,
temporalfeaturesand3D
features
Classification:HMM,FSM
Mitra
andAcharya(2007)
Human
trajectory
application
Video
tracking
Shoppers’trajectory
detection;people
counter
Feature
extraction:edge,
texture,etc.
Classification:SVM,KNN
Chanet
al.(2009);Truccoand
Plakas(2006);Yilmazet
al.
(2006)
An Introduction to Audio and Visual Research and Applications in Marketing 237
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
(see Zhao et al., 2003 for a review). There are three main types of methodsfor automatic face detection: template matching, feature based (e.g., skincolor), and image based that train machine systems on large numbers ofsamples (i.e., images labeled as face or nonface). Among these, image-basedmethods perform the best and are more capable of detecting multiple facesin a single image (Rowley, Baluja, & Kanade, 1998; Sung & Poggio, 1997).The human face is complex, so decomposing it into an effective set offeatures is critical to the ultimate success of face recognition. There are threemain types of feature extraction methods: (a) generic methods based onedges, lines, and curves, (b) feature-template-based methods based onspecific facial features such as eyes and chins, and (c) structural matchingmethods that are holistic and consider geometrical constraints on thefeatures. Here, we introduce two structural matching methods that havebeen demonstrated to be efficient: eigenface (Turk & Pentland, 1991) andfisherface (Belhumeur, Hespanha, & Kriegman, 1997). The basic idea ofeigenface is to use principal component analysis (PCA) to project faceimages onto a set of dominant eigenvectors (for a detailed implementation,see Turk & Pentland, 1991). Because each dominant eigenvector looksroughly like a face, these eigenvectors are also called eigenfaces. Usingeigenface method, the features extracted from each face is a vector ofeigenface loadings. The dissimilarity between two face images can be well-represented by the difference between two corresponding loading vectors.Eigenface is ‘‘the first successful demonstration of machine recognition offaces’’ (Zhao et al., 2003, p. 412). However, eigenface method suffers from adrawback in that it is vulnerable to noise, such as variations in lighting andfacial expression. To overcome this drawback, Belhumeur et al. (1997)proposed fisherface method. Fisherface method requires multiple images ofthe same face under various lighting and expression conditions, which arenot required by eigenface method. It is based on Fisher’s linear discriminant,which maximizes the ratio of between-face variation to that of within-facevariation. The last step, the recognition task, is just a simple classificationproblem. Many supervised learning algorithms could be used to achieve thegoal. For example, Turk and Pentland (1991) used KNN; Belhumeur et al.(1997) used SVM and achieved better classification accuracy than KNN.Face recognition system can be easily extended to gender classification(Moghaddam & Yang, 2002) by replacing the labels of the training imageswith gender (male or female) in last step and then performing a binaryclassification.
Facial expression recognition is the process of classifying facial motion andfacial feature deformation into classes that are purely based on visual
LI XIAO ET AL.238
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
information (Fasel & Luettin, 2003). Facial expression has gained attentionin marketing practice because it is a good and natural indication of acustomer’s internal emotions and mental activities (Russell & Fernandez-Dols, 1997). Shergill, Sarrafzadeh, Diegel, and Shekar (2008) develop acomputerized sales assistant system that depends highly on facial expressionrecognition technology. The intelligent sales assistant automatically scansthe store using a video camera, captures a shopper’s face in an image,estimates the shopper’s tendency to buy based on facial expression, andguides sales staff toward potential buyers while suggesting suitable salesstrategies. This computerized sales assistant system helps sales personnel tobetter allocate their time and efforts, which results in increased sales and lesslabor cost for retail stores while providing a more efficient shoppingexperience for customers. Facial expression recognition technology also hasbeen applied to online shopping contexts in virtual stores (e.g., Raouzaiou,Tsapatsoulis, Tzouvaras, Stamou, & Kollias, 2002).
The three major steps involved in facial expression recognition are facedetection, feature extraction, and expression classification. The first step isthe same as the first step in the face recognition task. Feature extractionmainly involves three types of features: geometric based, appearance based,and a combination of both. Geometric-based features measure thedisplacements of certain face regions such as the eyebrows or corners ofthe mouth, while appearance-based features are concerned with face texture.Geometric based approaches require reliable methods for points detectionand tracking, which are usually very difficult to obtain. So, appearance-basedfeatures are much more commonly used in this step. Appearance-basedfeatures may be extracted either holistically or locally. Holistic features aredetermined by processing the face as a whole, using, for example, eigenfacefeatures (Abboud, Davoine, & Dang, 2004). However, local features focus onspecific facial features or areas that are prone to change with facialexpressions, using, for example, local binary patterns (LBP) features for theeye and mouth regions (Shan, Gong, & McOwan, 2009). Expressionclassification is a classification problem in supervised learning, basicallyassigning expression labels to focal face images. Ideally, a large set of trainingimages should be collected for each facial expression of interest so as to createa sufficient representative sample. There are a large variety of algorithmsavailable for this step (see Fasel & Luettin, 2003 for a review). HMM andneural networks have been demonstrated to perform well in the literature(Ma & Khorasani, 2004; Oliver, Pentland, & Berard, 2000).
Currently, most facial expression recognition software applications(e.g., eMotions, FaceReader, and OKAO) focus on identifying six primary
An Introduction to Audio and Visual Research and Applications in Marketing 239
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
expressions, namely happiness, sadness, fear, disgust, surprise, and anger.Identified by Ekman and Friesen (1971), these six expressions are universalacross ethnicities and culture. Other expressions such as contempt andanxiety, which are included in Friesen and Ekman’s (1983) emotional facialaction code system (EMFACS), can also be detected using a three-stepapproach similar to the one described above, but accuracy is not satisfactorysince they are not very differentiable from a neutral expression. Generallyspeaking, more demonstrative facial expressions (e.g., happiness andsurprise) are easier to recognize with reasonable accuracy than sentimentalones (e.g., contempt and anxiety).
Eye gaze detection is a vision-based method that tracks the eye positionand eye movement to estimate the point of gaze (Hansen & Ji, 2010). Eyetracking has been extensively used in consumer research in recent years, forapplications ranging from packaging and copy testing to web usability (PRSInsights, 2012). Since they spend a huge amount of money on advertising,businesses want to know whether consumers pay attention, how muchattention is paid to their products/brands, whether their money is well spent,and how to improve their advertising campaigns. Eye tracking provides areal-time and reasonably good measure of attention when a consumer isbrowsing advertisements.
There are two main types of eye gaze detection solutions. One requires theuse of special devices other than video cameras. For example, an infraredlight radiator irradiates infrared light into an eye, two feature areas (thecorneal reflection light and pupil) are detected in the image obtained from avideo camera, and then the eye gaze direction is determined based on therelative positions. The other type works solely with video cameras, or evendevices as simple as a webcam (Savas, 2008; Sewell & Komogortsev, 2010).The first type achieves better accuracy, but it is obtrusive (usually theinfrared light radiator is installed in a headset) and expensive. The secondone is easy to use and unobtrusive, but sacrifices some detection precisionand accuracy. Here, we focus on the second type.
A typical webcam solution may follow four general steps for eye gazedetection: face detection, eye detection, feature extraction, and eye gazeclassification (Savas, 2008). The face detection step is the same as in facerecognition and facial expression recognition. Eye detection can beaccomplished by searching for a specific oriented contrast between regions(e.g., Haar-like classifiers in OpenCV) (Savas, 2008), or by searching for aspecific blinking pattern (Hansen & Ji, 2010) on the face. In terms of method,the feature extraction step is also similar to the one used in facial expressionrecognition. However, facial expression recognition extracts features from the
LI XIAO ET AL.240
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
face, whereas eye gaze detection extracts features particularly from the eyes.Many facial feature extraction methods can be extended to eye featureextraction with minimal modification. For example, the eigeneye method is asimple extension of the eigenface approach and works efficiently for eye gazedetection (Liu, Xu, & Fujimura, 2002). The last step is eye gaze classification,which is a supervised classification problem similar to expression classification.A large set of training eye images is needed to achieve reasonable accuracy(Savas, 2008; Sewell & Komogortsev, 2010).
Gesture recognition is the process of ‘‘recognizing meaningful expressionsof motion by a human, involving the hands, arms, face, head, and/or body’’(Mitra & Acharya, 2007, p. 311). Gesture recognition enables a human tointerface with a machine efficiently, and thus has been widely applied inhuman–computer interaction related applications. A well-known applica-tion of vision-based gesture recognition is Microsoft Kinect for Xbox.Kinect uses an infrared projector and camera, and a special microchip totrack the movements of objects and individuals in three dimensions so thesystem can interpret specific gestures, enabling completely hands-freecontrol of electronic devices (Naone, 2011).
A gesture recognition system consists of three main steps: detection ofspecific parts (e.g., hand, arm, etc.), feature extraction and gesture classifica-tion. Detection of shape, color and texture has been found to be effective fordetection of body parts such as hands and arms (Mitra & Acharya, 2007).‘‘Selecting good features is crucial to gesture recognition, since hand gesturesare very rich in shape variation, motion and textures’’ (Wu & Huang, 1999, p.105). In the literature, spatial features, temporal features and 3-D features havebeen proposed. In gesture recognition problem, since a gesture can be modeledas an ordered sequence of states in a spatiotemporal configuration space,HMM and finite state machine (FSM) are both efficient tools often used forgesture classification problems (Davis & Shah, 1994; Mitra & Acharya, 2007).
Human trajectory detection is the process of tracking human objects in avideo and then plotting their spatial locations as a function of time. This is astraightforward application of object tracking, basically detecting humanobjects in each frame of a video in an independent or dependent manner (seeTrucco & Plakas, 2006; Yilmaz, Javed, & Shah, 2006 for a review).Although implementation is not difficult, it has significant implications forretailing practice.
Tracking shopper’s moving trajectory is one important application ofhuman trajectory detection technology in retailing practice. Some retailersanalyze in-store surveillance video to track customer shopping paths. Comb-ining the shopping path information with store layouts, retailers can better
An Introduction to Audio and Visual Research and Applications in Marketing 241
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
understand how customers move through and interact with the store. Bytracking customers’ movements, retailers can learn where they pause or movequickly, or where there is congestion. This information can help retailersimprove layouts and planograms (Levy & Weitz, 2008; Underhill, 2008).
One big challenge associated with shopper trajectory detection is simul-taneously tracking multiple objects (i.e., shoppers) (Khan, Javed, Rasheed, &Shah, 2001). Most of the time, multiple customers are shopping in the storesimultaneously. In addition, new customers are entering and existing cus-tomers are completing their purchases and exiting the store. Also, one videocamera provides only limited coverage; therefore, quite a few cameras mustbe installed at different locations with different angles so as to capturedifferent perspectives of a store. Establishing correspondence betweenobjects captured in different cameras can be difficult (Khan et al., 2001).
Vision-based people counting solutions are another widely used applica-tion of human trajectory detection technology in the retail industry.Retailers count shoppers for many reasons (e.g., to calculate the purchaseconversion rate or to manage staff shifts). People counters measure thenumber and direction of people traversing a certain passage or entrance perunit time. Such devices often are placed at store entrances or embedded intovideo surveillance systems. One big challenge associated with vision-basedpeople counters is how to efficiently cope with high density of people.Chan et al. (2009) proposed a crowd counting system, which is basedon Gaussian Process regression on holistic features such as edge, texture,etc. They empirically tested the proposed system and showed that it wouldreduce the error rate to less than 20% under high density condition.
AUDIO/VISUAL DATA IN RESEARCH IN
MARKETING AND RELATED BUSINESS FIELDS
A growing number of studies in business, and specifically marketing, useaudio or visual data to help answer important questions in business researchand practice.
Research Based on Audio Data
Using the human voice to infer various emotional constructs has beenprevalent in other fields, but its application in marketing has been minimal.Brickman (1976, 1980) used voice analysis to determine positive/negative
LI XIAO ET AL.242
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
responses to different product attributes, and predicted which consumersfrom a target group would be most likely to try a forthcoming product.Nelson and Schwartz (1979) applied their voice analysis methodology to testattitudinal scales, consumer interest in products, and advertising effective-ness. However, these researchers focused on one-dimensional constructs.Since the human voice is multidimensional, we may be able to obtain deeperinsights in addition to the response itself.
For example, in a survey, which is a commonly used marketing researchmethod, it is possible that a subject may not always be certain of his/herresponse to a question; generally, this aspect is not considered in practice. Kim,Shi, and Ding (2012) proposed the use of human speech as an alternate datainput format, and inferred uncertainty using various prosodic features extrac-ted from respondents’ voices. They found support that uncertainty inferredfrom speech can improve the accuracy of insights from survey responses.
In other business literature, Backhaus, Meyer, and Stockert (1985) usedvoice analysis to measure the activation component (in contrast to cognitivefactors) in the bargaining process of capital goods markets, where they showedthat voice pitch may be used as a valid activational indicator. Allmon andGrant (1990) used voice stress analysis to evaluate responses of real estatesalespeople to ethically based questions. Some respondents showed stress whilefollowing the ethical code guidelines, while others showed no stress aboutbreaking the formal code. Mayew and Venkatachalam (2012) measuredmanagerial affective states during earnings conference calls by analyzingconference call audio files using commercial emotional speech analysissoftware. They found evidence that when managers are scrutinized by analystsduring conference calls, their positive or negative affect provides informationabout the firm’s financial future. In another study, Hobson, Mayew, andVenkatachalam (2012) examined whether vocal markers of cognitivedissonance are useful for detecting financial misreporting. They used speechsamples of CEOs during earnings conference calls and generated vocaldissonance markers using automated vocal emotion analysis software. Theyfound that vocal dissonance markers are positively associated with thelikelihood of irregularity restatements made by CEOs.
Research Based on Video Data
Several researchers have acknowledged the potential of video data inmarketing research contexts. Hui, Fader, and Bradlow (2009a, 2009b)proposed using video technology together with RFID technology to detect
An Introduction to Audio and Visual Research and Applications in Marketing 243
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
customer shopping paths and then linked the shopping paths to purchasingbehavior. Zhang, Li, and Burke (2012) tracked customer interactions withsalespeople and their companions along their shopping routes and studiedthe effect of such interactions on purchase tendencies. Valizade-Funder,Heil, and Jedidi (2012) proposed to embed gender classification into shoppertrajectory tracking so as to study how store promotions affect male andfemale shoppers differently.
Eye tracking data has been applied widely in visual attention research onprint, TV, and online advertisements (See Wedel & Pieters, 2007 for a review).Video-based eye trackers use eye gaze detection technology to determine whatpeople are looking at when they watch ads. Eye tracking data could also beused in a retail context to check the effectiveness of in-store and out-of-storemarketing (e.g., Chandon, Hutchinson, Bradlow, & Young, 2009). Teixeira,Wedel, and Pieters (2012) studied how advertisers can leverage emotion andattention to engage consumers while watching Internet video advertisements.In a controlled experiment, the authors assessed joy and surprise throughautomated facial expression detection for a sample of advertisements.
In addition to quantitative research, video data also has been extensivelyused for qualitative marketing research. Due to its superiority ininformation richness, objectiveness and cost efficiency (Kirkup & Carrigan,2000), videography often serves as a primary source of consumer datacollection or a supplementary source to other data collection methods suchas survey and direct observation (see Belk & Kozinets, 2005 for a review).
CONCLUSION
Compared with traditional data that are extensively used in marketingresearch such as purchase data and survey data, multimedia data providemuch richer information. In this chapter, we have discussed audio andvisual data, typical data analysis methods, applications in practice, andacademic literature in marketing and other related business disciplines. Wehope we have inspired scholars to utilize audio and visual data in marketingresearch. While the potential application areas are wide open, we want toconclude this chapter by proposing one type of application: AE.
We define AE as the ability of nonhuman models to predict a person’sinternal state (e.g., cognitive, affective, physical) given the signals he or sheemits (e.g., facial expression, voice, gesture) or to predict a person’s reaction(including, but not limited to internal states) when he or she is exposed to agiven set of stimuli (e.g., facial expression, voice, gesture, graphics, music, etc.).
LI XIAO ET AL.244
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
In this chapter, we have discussed many examples that fall within the first partof AE. The inference of uncertainty from survey response (Kim et al. 2012) isone such case. The second part of AE is less researched, and we will illustratethis with an application in the analysis of print advertising. Xiao and Ding(2012) studied the effect of facial features of human faces on people’s reactionsto print advertisements and the heterogeneity of the effect among people. Inorder to achieve this goal, they proposed a sequential approach. They firstemployed the eigenface method (Turk & Pentland, 1991) to decompose realfaces and extract facial features (i.e., loadings on dominant eigenfaces), andanalyze the data based on the eigenface loadings. Next, they used thephysiognomic method (Berry & McArthur, 1985; Cunningham, Barbee, &Pike, 1990), which represents each face using a set of facial distances, such asface height, eye length and chin width, to explain the results intuitively. Theyare the first to have introduced quantitative methods to study faces in amarketing context. Given the fact that faces are heavily used in marketingpractice (e.g., in advertising and through virtual agents), hopefully such aquantitative approach will encourage future face studies in marketing.
We hope this chapter will contribute to the broader adoption of audioand visual data research in marketing. Given the rich information containedin such data, availability of data, feasibility of storage, and computationalpower, we are confident that these data will contribute to better marketingpractices with the help of marketing scholars.
ACKNOWLEDGMENT
The authors thank Eelco Kappe, David Miller, and Robert Collins for theirhelpful comments.
NOTES
1. Text data is another interesting source of data. It could be extracted from audioor visual data. For example, speech recognition techniques can be used to extract textinformation from audio data (Furui, Kikuchi, Shinnaka, & Hori, 2004), while textrecognition techniques are used to extract text information from image/video data(see Jung, Kim, & Jain, 2004 for a review). Other than being recognized from audioor visual sources, text data can also be acquired from various sources withoutrecognition, for example, electronic documents, e-books, webpages, web blogs, andso forth. The methods of detecting useful patterns and trends from text data fordecision making purpose fall into a broad area called text mining, which is a subarea
An Introduction to Audio and Visual Research and Applications in Marketing 245
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
of data mining and has much overlap with machine learning (Berry & Kogan, 2010).For example, Eliashberg, Hui, and Zhang (2007) apply text learning technique toanalyzing move scripts and forecasting a movie’s return on investment.2. Machine vision is a field concerned with technology and methods that provide
image-based inspection and analysis in industry (Steger, Ulrich, & Wiedemann,2008). Its applications include automatic inspection, process control, and robotguidance, among others. Machine vision has much overlap with computer vision,and sometimes is regarded as an application of computer vision to industrial tasks.3. 1 byte = 8 bits; 1 kilobyte = 1,024 bytes; 1 megabyte = 1,024 kilobytes; 1
gigabyte = 1,024 megabytes; 1 terabyte = 1,024 gigabytes.4. In general there are two types of resolution being widely used in practice, pixel
resolution and spatial resolution. Pixel resolution concerns with pixel count in digitalimaging, which is often used in digital camera field. Spatial resolution refers to thenumber of independent pixel values per unit length, which concerns with the smallestlevel of detail visible on the object of interest and thus matters more for computervision than pixel resolution. Here we refer to pixel resolution.5. There is one other type of machine learning called semi-supervised learning,
where the training data is partially labeled. In other words, only part of the trainingdata (usually a very small part in practice) contains both input variables (X’s) andcorresponding target variables (y’s) as in the supervised learning task, while the restcontains only input variables and no corresponding target variables as in theunsupervised learning task. It falls between supervised learning and unsupervisedlearning (see Pise & Kulkarni, 2008).6. Recommender system is an important application ofmachine learning techniques,
especially supervised learning techniques, to audio and visual data (see Adomavicius &Tuzhilin, 2005 for a review). A famous recommender system in audio is PandoraInternet Radio service, which selects and plays songs in real time to fit a specific user’spreferences based on his/her positive or negative feedback for previously played songs(www.pandora.com). A famous application in visual is Netflix prize, which is an opencompetition for the best machine learning algorithm to predict individual user’s ratingsfor films based solely on his/her previous ratings (www.netflixprize.com). Recommen-der system is an interesting area of itself, but we do not discuss it in details in the presentchapter because it is not closely associated with audio/visual data processing.7. Audio and visual data are sometimes integrated with text data (Qi et al., 2000).
That is, multiple data modalities are considered simultaneously in making a decision,or decisions are made from the different modalities separately and fused afterwards(Chibelushi, Deravi, & Mason, 2002).
REFERENCES
Abboud, B., Davoine, F., & Dang, M. (2004). Facial expression recognition and synthesis based
on an appearance model. Signal Processing: Image Communication, 19(8), 723–740.
Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems:
A survey of the state-of-the-art and possible extensions. IEEE Transactions on
Knowledge and Data Engineering, 17(6), 734–749.
Allmon, D. E., & Grant, J. (1990). Real estate sales agents and the code of ethics: A voice stress
analysis. Journal of Business Ethics, 9, 807–812.
LI XIAO ET AL.246
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Ang, J., Dhillon, R., Krupski, A., Shriberg, E., & Stolcke, A. (2002). Prosody-based automatic
detection of annoyance and frustration in human computerdialog. In Proceedings of the
international conference on spoken language processing, Denver, CO (pp. 2037–2040).
Anolli, L., & Ciceri, R. (1997). The voice of deception: Vocal strategies of naıve and able liars.
Journal of Nonverbal Behavior, 21(4), 259–284.
Asada, M., Noda, S., Tawaratsumida, S., & Hosoda, K. (1996). Purposive behavior
acquisition for a real robot by vision based reinforcement learning. Machine Learning,
23, 279–303.
Asada, M., Uchibe, E., & Hosoda, K. (1999). Cooperative behavior acquisition for mobile
robots in dynamically changing real worlds via vision-based reinforcement learning and
development. Artificial Intelligence, 110, 275–292.
Backhaus, K., Meyer, M., & Stockert, A. (1985). Using voice analysis for analyzing bargaining
processes in industrial marketing. Journal of Business Research, 13, 435–446.
Batliner, A., Hacker, C., Steidl, S., Noth, E., D’Arcy, S., Russell, M. J., & Wong, M. (2004).
You stupid tin box – Children interacting with the AIBO robot: A cross-linguistic
emotional speech corpus. In Proceedings of LREC European language resources
association.
BBC News. (2003). CCTV to drive down cab attacks. Retrieved from http://news.bbc.co.uk
Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 19(7), 711–720.
Belk, R. W., & Kozinets, R. V. (2005). Videography in marketing and consumer research.
Qualitative Market Research: An International Journal, 8(2), 128–141.
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, D., y Wellekens, C.
(2007). Automatic speech recognition and speech variability: A review. Speech
Communication, 49, 763–786.
Berry, M. W., & Kogan, J. (2010). Text mining: Applications and theory. Chichester, UK: Wiley.
Berry, D., & McArthur, L. (1985). Some components and consequences of a babyface. Journal
of Personality and Social Psychology, 48(2), 312–323.
Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer.
Bond, C. F., Jr., & DePaulo, B. M. (2006). Accuracy of deception judgments. Personalirt and
Social Psychology Review, 10(3), 214–234.
Brennan, S. E., & Williams, M. (1995). The feeling of another’s knowing: Prosody and filled
pauses as cues to listeners about the metacognitive states of speakers. Journal of Memory
and Language, 34, 383–398.
Brickman, G. A. (1976). Voice analysis. Journal of Advertising Research, 16(3), 43–48.
Brickman, G. A. (1980). Uses of voice-pitch analysis. Journal of Advertising Research, 20(2),
69–73.
Burgoon, J. K., & Qin, T. (2005). The dynamic nature of deceptive verbal communication.
Journal of Language and Social Psychology, 25(1), 76–96.
Burke, R. R. (2005). The third wave of marketing intelligence. In M. Drafft & M. K. Mantrala
(Eds.), Retailing in the 21st century current and future trends (pp. 113–125). New York,
NY: Springer.
Burkhardt. F., Ajmera, J., Englert, R. Stegmann, J., & Burleson, W. (2006). Detecting anger in
automated voice portal dialogs. In Proceedings of interspeech.
Chan, A. B., Morrow, M., & Vasconcelos, N. (2009). Analysis of crowded scenes using holistic
properties. 11th IEEE international workshop on performance evaluation of tracking and
surveillance (pp. 101–108).
Chandon, P., Hutchinson, J. W., Bradlow, E. T., & Young, S. H. (2009). Does in-store
marketing work? Effects of the number and position of shelf facings on brand
An Introduction to Audio and Visual Research and Applications in Marketing 247
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
attention and evaluation at the point of purchase. Journal of Marketing, 73(November),
1–17.
Chibelushi, C. C., Deravi, F., & Mason, J. S. D. (2002). A review of speech-based bimodal
recognition. IEEE Transactions on Multimedia, 4(1), 23–37.
Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W, & Taylor, J.
(2001). Emotion recognition in human-computer interaction. IEEE Signal Processing
Magazine, 18(1), 32–80.
Crystal, D. (1966). The linguistic status of prosodic and paralinguistic features. Proceedings of
the University of Newcastle-upon Tyne Philosophical Society, 1(8), 93–108.
Cunningham, M. R., Barbee, A. P., & Pike, C. L. (1990). What do women want? Facialmetric
assessment of multiple motives in the perception of male facial physical attractiveness.
Journal of Personality and Social Psychology, 59(1), 61–72.
Dai, K., Fell, H., & MacAuslan, J. (2009). Comparing emotions using acoustics and human
perceptual dimensions. Proceedings of the 27th international conference extended
abstracts on Human factors in computing systems.
Davis, J., & Shah, M. (1994). Visual gesture recognition. Vision, Image and Signal Processing,
141, 101–106.
DePaulo, B. M., Linsay, J. J., Malone, B. E., Muhlenbruck, L., Charlton, K., & Cooper, H.
(2003). Cues to deception. Psychological Bulletin, 129(1), 74–118.
Ekman, P. (1999). Basic emotions. In T. Dalgleish & M. J. Power (Eds.), Handbook of cognition
and emotion (pp. 45–60). New York, NY: Wiley.
Ekman, P., & Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal
of Personality and Social Psychology, 17(2), 124–129.
Ekman, P., O’Sullivan, M., Friesen, W. V., & Scherer, K. R. (1991). Face, voice, and body in
detecting deceit. Journal of Nonverbal Behavior, 15(2), 125–135.
Eliashberg, J., Hui, S. K., & Zhang, Z. J. (2007). From story line to box office: A new approach
for green-lighting movie scripts. Management Science, 53(6), 881–893.
Enos, F. (2009). Detecting deception in speech. Ph.D. dissertation, Columbia University,
Columbia.
Enos, F., Benus, S., Cautin, R. L., Graciarena, M., Hirschberg, J., & Shriberg, E. (2006).
Personality factors in human deception detection: Comparing human to machine
performance. Proceedings of interspeech (pp. 813–816).
Fasel, B., & Luettin, J. (2003). Automatic facial expression analysis: A survey. Pattern
Recognition, 36, 259–275.
Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-
invariant learning. Proceedings in IEEE computer society conference on computer vision
and pattern recognition (pp. 2, 264–271).
Fernandez, R., & Picard, R. W. (2003). Modeling drivers’ speech under stress. Speech
Communication, 40, 145–159.
Forbes-Riley, K., & Litman, D. (2011). Benefits and challenges of real-time uncertainty
detection and adaptation in a spoken dialogue computer tutor. Speech Communication,
53(9–10), 1115–1136.
France, D. J., Shiavi, R. G., Silverman, S., Silverman, M., & Wilkes, D. M. (2000). Acoustical
properties of speech as indicators of depression and suicidal risk. IEEE Transactions on
Biomedical Engineering, 47(7), 829–837.
Friedman, N., & Russell, S. (1997). Image segmentation in video sequences: A probabilistic
approach, In Proceedings of the 13th conference on uncertainty in artificial intelligence
(UAI) (pp. 175–181).
LI XIAO ET AL.248
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Friesen, W. V., & Ekman, P. (1983). EMFACS-7: Emotional facial action coding system.
Unpublished manual, University of California.
Furui, S., Kikuchi, T., Shinnaka, Y., & Hori, C. (2004). Speech-to-text and speech-to-speech
summarization of spontaneous speech. IEEE Transactions on Speech and Audio
Processing, 12(4), 401–408.
Ganapathiraju, A., Hamaker, J., & Picone, J. (2004). Applications of support vector machines
to speech recognition. IEEE Transactions on Signal Processing, 52(8), 2348–2355.
Gish, H., Siu, M.-H., & Rohlicek, R. (1991). Segregation of speakers for speech recognition and
speaker identification. International conference on acoustics, speech, and signal processing
(pp. 2, 873–876).
Gonzalez, R. C., & Woods, R. E. (2008). Digital image processing (3rd ed.). Upper Saddle
River, NJ: Pearson Education, Inc.
Haddad, D., Walter, S., Ratley, R., & Smith, M. (2001). Investigation and evaluation of voice
stress analysis technology. Technical Report. National Criminal Justice Reference
Service.
Hand, D. J. (1998). Data mining: Statistics and more? The American Statistician, 52(2),
112–118.
Hansen, D. W., & Ji, Q. (2010). In the eye of the beholder: A survey of models for eyes and
gaze. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 478–500.
Hansen, J. H. L., Kim, W., Rahurkar, M., Ruzanski, E., & Meyerhoff, J. (2011). Robust
emotional stressed speech detection using weighted frequency subbands. EURASIP
Journal on Advances in Signal Processing, 2011, 1–10.
Hirschberg, J., Benus, S., Brenier, J. M., Enos, F., Friedman, S., Gilman, S., Girand, C.,
Graciarena, M., Kathol, A., Michaelis, L., Pellom, B., Shriberg, E., & Stolcke, A. (2005).
Distinguishing deceptive from non-deceptive speech. Proceedings in interspeech (pp.
1833–1836).
Hobson, J. L., Mayew, W. J., & Venkatachalam, M. (2012). Analyzing speech to detect
financial misreporting. Journal of Accounting Research, 50(2), 349–392.
Hui, S. K., Fader, P. S., & Bradlow, E. T. (2009a). Path data in marketing: An integrative
framework and prospectus for model-building. Marketing Science, 28(2), 320–335.
Hui, S. K., Fader, P. S., & Bradlow, E. T. (2009b). The traveling salesman goes shopping: The
systematic deviations of grocery paths from TSP-optimality. Marketing Science, 28(3),
566–572.
Jenssen, R., & Eltoft, T. (2003). Independent component analysis for texture segmentation.
Pattern Recognition, 36(10), 2301–2315.
Juang, B. H., & Rabiner, L. R. (1991). Hidden markov models for speech recognition.
Technometrics, 33(3), 251–272.
Jung, K., Kim, K. I., & Jain, A. K. (2004). Text information extraction in images and video: A
survey. Pattern Recognition, 37(5), 977–997.
Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement learning: A survey.
Journal of Artificial Intelligence Research, 4, 237–285.
Khan, S., Javed, O., Rasheed, Z., & Shah, M. (2001). Human tracking in multiple
cameras. In Proceedings of IEEE international conference on computer vision (ICCV)
(pp. 1, 331–336).
Kim, H.-J., Shi, H., & Ding, M. (2012). Improving survey response accuracy through inferred
uncertainty from voice. Working Paper.
Kirkup, M., & Carrigan, M. (2000). Video surveillance research in retailing: Ethical issues.
International Journal of Retail & Distribution Management, 28(11), 470–480.
An Introduction to Audio and Visual Research and Applications in Marketing 249
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Klevans, R. L., & Rodman, R. D. (1997). Voice recognition. Boston, MA: Artech House.
Lee, T.-W., & Lewicki, M. S. (2002). Unsupervised image classification, segmentation, and
enhancement using ica mixture models. IEEE Transactions on Image Processing, 11(3),
270–279.
Lee, K.-F., & Mahajan, S. (1990). Corrective and reinforcement learning for speaker-
independent continuous speech recognition. Computer Speech & Language, 4(3),
231–245.
Lee, C. M., & Narayanan, S. S. (2005). Toward detecting emotions in spoken dialogs. IEEE
Transactions on Speech and Audio Processing, 13(2), 293–303.
Lehmann, T. M., Guld, M. O., Deselaers, T., Keysers, D., Schubert, H., Spitzer, K., y Wein,
B. B. (2005). Automatic categorization of medical images for content-based retrieval and
data mining. Computerized Medical Imaging and Graphics, 29, 143–155.
Levy, M., & Weitz, B. (2008). Retailing management (7th ed.). Irwin, CA: McGraw-Hill.
Litman, D., & Forbes, K. (2003). Recognizing emotions from student speech in tutoring
dialogues. Proceedings of IEEE automatic speech recognition and understanding workshop
(ASRU) (pp. 25–30).
Liu, C.-L., Nakashima, K., Sako, H., & Fujisawa, H. (2003). Handwritten digit
recognition: Benchmarking of state-of-the-art techniques. Pattern Recognition, 36(10),
2271–2285.
Liu, X., Xu, F., & Fujimura, K. (2002). Real-time eye detection and tracking for driver
observation under various light conditions. In Proceedings of IEEE intelligent vehicle
symposium (pp. 344–351).
Loy, G., & Eklundh, J.-O. (2006). Detecting symmetry and symmetric constellations of features.
Proceedings of ECCV, 2, 508–521.
Ma, L., & Khorasani, K. (2004). Facial expression recognition using constructive feed-
forward neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 34(3),
1588–1595.
Mayew, W. J., & Venkatachalam, M. (2012). The power of voice: Managerial affective states
and future firm performance. The Journal of Finance, 67(1), 1–43.
Mitchell, T. M. (1997). Machine learning. MIT Press and McGraw-Hill Companies, Inc.
Mitchell, T. M. (2006). The discipline of machine learning. Technical Report No. CMUML-06-
108. Carnegie Mellon University.
Mitra, S., & Acharya, T. (2007). Gesture recognition: A survey. IEEE Transactions on Systems,
Man, and Cybernetics – Part C: Application and Reviews, 37(3), 311–324.
Moghaddam, B., & Yang, M.-H. (2002). Learning gender with support faces. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 24(5), 707–711.
Morrison, D., Wang, R., & De Silva, L. C. (2007). Ensemble methods for spoken emotion
recognition in call-centres. Speech Communication, 49, 98–112.
Murray, I. R., & Arnott, J. L. (1993). Toward the simulation of emotion in synthetic speech: A
review of the literature on human vocal emotion. Journal of the Acoustical Society of
America, 93(2), 1097–1108.
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary
survey. Data Mining and Knowledge Discovery, 2(4), 345–389.
Nakatsu, R., Nicholson, J., & Tosa, N. (2000). Emotion recognition and its application to
computer agents with spontaneous interactive capabilities. Knowledge-Based Systems,
13, 497–504.
Naone, E. (2011). Microsoft Kinect: How the device can respond to your voice and gestures.MIT
Technology Review (January/February). Retrieved from www.technologyreview.com
LI XIAO ET AL.250
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Nefian, A. V., & Hayes, M. H. III (1998). Hidden markov models for face recognition.
Proceedings of the IEEE international conference on acoustics, speech and signal
processing (pp. 5, 2721–2724).
Nelson, R. G., & Schwartz, D. (1979). Voice-pitch analysis. Journal of Advertising Research,
55–59.
Newman, L., Matthew, J. W., Pennebaker, D. S., Berry, & Richards, J. M. (2003). Lying words:
Predicting deception from linguistic styles. Personality and Social Psychology Bulletin,
29(5), 665–675.
Nwe, T. L., Foo, S. W., & De Silva, L. C. (2003). Speech emotion recognition using hidden
markov models. Speech Communication, 41(4), 603–623.
Oliver, N., Pentland, A., & Berard, F. (2000). LAFTER: A real-time face and lips tracker with
facial expression recognition. Pattern Recognition, 33, 1369–1382.
Paletta, L., & Pinz, A. (2000). Active object recognition by view integration and reinforcement
learning. Robotics and Autonomous Systems, 31, 71–86.
Peng, J., & Bhanu, B. (1998). Closed-loop object recognition using reinforcement learning.
IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(2), 139–154.
Picard, R. (2003). Affective computing: Challenges. International Journal of Human-Computer
Studies, 59, 55–64.
Pise, N. N., & Kulkarni, P. (2008). A survey of semi-supervised learning methods. In
Proceedings for international conference on computational intelligence and security (pp. 2,
30–34).
Pomerleau, D. A. (1993). Knowledge-based training of artificial neural networks for
autonomous robot driving. In J. H. Connell & S. Mahadevan (Eds.), Robot learning,
(233, pp. 19–43).
Pon-Berry, H. (2008, September). Prosodic manifestations of confidence and uncertainty in
spoken language. Proceedings of Interspeech (pp. 74–77).
Pon-Berry, H., & Shieber, S. M. (2011). Recognizing uncertainty in speech. EURASIP Journal
on Advances in Signal Processing, Special Issue on Emotion and Mental State
Recognition from Speech.
PRS Insights. (2012). Getting the most from eye-tracking. Retrieved from www.prsresearch.com
Qi, W., Gu, L., Jiang, H., Chen, X.-R., & Zhang, H.-J. (2000). Integrating visual, audio, and
text analysis for news video. Proceedings of international conference on image processing
(pp. 10–13).
Ranganath, R., Jurafsky, D., & McFarland, D. (2009). It’s not you it’s me: Detecting flirting
and its misperception in speed-dates. In Proceedings of the conference on empirical
methods in natural language processing (EMNLP ’09) (pp. 334–342).
Raouzaiou, A., Tsapatsoulis, N. Tzouvaras, V., Stamou, G., & Kollias, S. D. (2002). A hybrid
intelligence system for facial expression recognition. In Proceedings of European
symposium on intelligent technologies, hybrid systems and their implementation on smart
adaptive systems (pp. 482–490).
Richmond, S. (2012). YouTube users uploading two days of video every minute, The Daily
Telegraph (London), Retrieved from www.telegraph.co.uk
Roberts, L. (2012). History the history of video surveillance – From vcrs to eyes in the sky.
Retrieved from http://www.wecusurveillance.com/cctvhistory
Rockwell, P., Buller, D. B., & Burgoon, J. K. (1997). The voice of deceit: Refining and
expanding vocal cues to deception. Communication Research Reports, 14(4), 451–459.
Rowley, H. A., Baluja, S., & Kanade, T. (1998). Neural network based face detection. IEEE
Transactions on Pattern Recognition and Machine Intelligence, 20(1), 23–38.
An Introduction to Audio and Visual Research and Applications in Marketing 251
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Russell, J. A., & Fernandez-Dols, J. M. (1997). The psychology of facial expression. Cambridge,
UK: Cambridge University Press.
Sajda, P. (2006). Machine learning for detection and diagnosis of disease. Annual Review of
Biomedical Engineering, 8, 537–565.
Samuel, A. (1959). Some studies in machine learning using the game of checkers. IBM Journal,
3(3), 210–229.
Savas, Z. (2008). TrackEye : Real-time tracking of human eyes using a webcam. Retrieved from
http://www.codeproject.com
Schuller, B., Muller, R., Eyben, F., Gast, J., Hornler, B., Wollmer, M., y Konosu, H. (2009).
Being bored? Recognising natural interest by extensive audiovisual integration for real-
life application. Image and Vision Computing, 27, 1760–1774.
Schuller, B., Anton, B., Steidl, S., & Seppi, D. (2011). Recognising realistic emotions and affect
in speech: State of the art and lessons learnt from the first challenge. Speech
Communication, 53(9–10), 1062–1087.
Sewell, W., & Komogortsev, O. (2010). Real-time eye gaze tracking with an unmodified
commodity webcam employing a neural network. Proceedings of the 28th of the
international conference extended abstracts on human factors in computing systems
(pp. 3739–3744).
Shami, M., & Verhelst, W. (2007). An evaluation of the robustness of existing supervised
machine learning approaches to the classification of emotions in speech. Speech
Communication, 49, 201–212.
Shan, C., Gong, S., & McOwan, P. W. (2009). Facial expression recognition based on local
binary patterns: A comprehensive study. Image and Vision Computing, 27, 803–816.
Shapiro, L. G., & Stockman, G. C. (2001). Computer vision. Prentice Hall.
Shergill, G. S., Sarrafzadeh, A., Diegel, O., & Shekar, A. (2008). Computerized sales assistants:
The application of computer technology to measure consumer interest-a conceptual
framework. Journal of Electronic Commerce Research, 9(2), 176–191.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on
Pattern Recognition and Machine Intelligence, 22(8), 888–905.
Singh, S., Kearns, M., Litman, D., & Walker, M. (1999). Reinforcement learning for spoken
dialogue systems. In Proceedings of NIPS, Denver, CO.
Slaney, M., & McRoberts, G. (2003). Baby ears: A recognition system for affective
vocalizations. Speech Communication, 39(3/4), 367–384.
Smith, V. L., & Clark, H. H. (1993). On the course of answering questions. Journal of Memory
and Language, 32(1), 25–38.
Steger, C., Ulrich, M., & Wiedemann, C. (2008). Machine vision algorithms and applications.
Weinheim: Wiley-VCH.
Streeter, L. A., Krauss, R. M., Geller, V., Olson, C., & Apple, W. (1977). Pitch changes during
attempted deception. Journal of Personality and Social Psychology, 35(5), 345–350.
Sung, K. K., & Poggio, T. (1997). Example-based learning for view-based human face
detection. IEEE Transactions on Pattern Recognition and Machine Intelligence, 20(1),
39–51.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA:
MIT Press.
Szeliski, R. (2011). Computer vision: Algorithms and applications. London: Springer.
Tatiraju, S., & Mehta, A. (2008). Image segmentation using k-means clustering, EM and
normalized cuts. Irvine, CA: University Of California.
LI XIAO ET AL.252
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610
Teixeira, T., Wedel, M., & Pieters, R. (2012). Emotion-induced engagement in internet video
advertisements, 49(2), 144–159.
Trucco, E., & Plakas, K. (2006). Video tracking: A concise survey. IEEE Journal of Oceanic
Engineering, 31(2), 520–529.
Turk, M., & Pentland, A. (1991). Face recognition using eigenface. Journal of Cognitive
Neuroscience, 3(1), 71–86.
Underhill, P. (2008). Why we buy: The science of shopping–updated and revised for the internet,
the global consumer, and beyond. New York, NY: Simon and Schuster Paperbacks.
Valizade-Funder, S., Heil, O., & Jedidi, K. (2012). Impact of retailer promotions on store traffic -
a video-based technology. Working Paper.
Ververidis, D., & Kotropoulos, C. (2006). Emotional speech recognition: Resources, features,
and methods. Speech Communication, 48, 1162–1181.
Wall Street Journal. (2007). Talk is cheap in politics, But a deep voice helps – Vocal experts
measure candidates’ likability; Toting up ‘Ums’ and ‘Ahs’. Wall Street Journal,
November 3.
Wedel, M., & Pieters, R. (2007). A review of eye-tracking research in marketing. In
N. K. Malhotra (Ed.), Review of marketing research, (4, pp. 123–147).
Weinberger, K. Q., & Saul, L. K. (2006). Unsupervised learning of image manifolds by
semidefinite programming. International Journal of Computer Vision, 70(1), 77–90.
Wu, Y., & Huang, T. S. (1999). Vision-based gesture recognition: A review. Lecture Notes in
Computer Science, 1739, 103–115.
Xiao, L., & Ding, M. (2012). Just the faces: Explore the effects of facial features in print
advertising. Working Paper.
Xiong, W., Litman, D. J., & Marai, G. E. (2009). Analyzing prosodic features and student
uncertainty using visualization, Cognitive and Metacognitive Educational Systems:
Papers from the AAAI Fall Symposium, pp. 93–98.
Yilmaz, A., Javed, O., & Shah, M. (2006). Object tracking: A survey. ACM Computing Surveys,
38(4). Article 13.
Yu, F., Chang, E., Xu, Y.-Q., & Shum, H.-Y. (2001). Emotion detection from speech to enrich
multimedia content. Proceedings of the second IEEE pacific rim conference on
multimedia: Advances in multimedia information processing (pp. 550–557).
Zhang, G. P. (2000). Neural networks for classification: A survey. IEEE Transactions on
Systems, Man, and Cybernetics, 30(4), 451–462.
Zhang, Y., Brady, M., & Smith, S. (2001). Segmentation of brain MR images through a hidden
markov random field model and the expectation-maximization algorithm. IEEE
Transactions on Medical Imaging, 20(1), 45–57.
Zhang, H., Fritts, J. E., & Goldman, S. A. (2008). Image segmentation evaluation: A survey of
unsupervised methods. Computer Vision and Image Understanding, 110, 260–280.
Zhang, X., Li, S., & Burke, R. (2012). Modeling the dynamic influence of group interaction and
the store environment on shopper preferences and purchase behavior. Working Paper.
Zhao, W., Chellappa, R., Phillips, P. J., & Rosenfeld, A. (2003). Face recognition: A literature
survey. ACM Computing Surveys, 35(4), 399–458.
Zhou, L., Burgoon, J. K., Twitchell, D. P., Qin, T., & Nunamaker, J. F., Jr. (2004). A
comparison of classification methods for predicting deception in computer-mediated
communication. Journal of Management Information Systems, 20(4), 139–165.
An Introduction to Audio and Visual Research and Applications in Marketing 253
© M
alho
tra,
Nar
esh
K.,
Jun
24, 2
013,
Rev
iew
of
Mar
ketin
g R
esea
rch
Em
eral
d G
roup
Pub
lishi
ng L
imite
d, B
radf
ord
, ISB
N: 9
7817
8190
7610