Musicians rock on short-term memory and multisensory...

D

R

A

F

T

Musicians rock on short-term memory

and multisensory integration

Avigael M. Aizenman, Jason M. Gold

ú& Robert Sekuler

Brandeis University & Indiana Universityú

Supported by CELEST, an NSF Science of Learning Center (SBE-035478), National Institutes ofHealth grant EY-019265, and by AFOSR grant FA9550-10-1-0420. We thank Trevor Agus, Barbara Shinn-Cunningham and Randolph Blake for valuable comments on earlier versions of this paper, and Abigail Noyceand Arielle Keller for their assistance. A version of this work was presented to the 2013 meeting of the VisionSciences Society. e-mail: [email protected]

MULTISENSORY INTEGRATION 2

Abstract

Musicians may have very good memory for sounds, but does that ability extend to other

kinds of stimuli as well? For answers, we assessed musicians’ and non-musicians’ short-term

memory for rapidly-presented, quasi-random sequences whose components varied in lumi-

nance (visual stimuli), or frequency (auditory stimuli), or both (audiovisual stimuli). In

all cases, subjects judged whether a sequence’s last four items replicated its first four. For

some audiovisual sequences, the frequency of each auditory item was monotonically related

to the accompanying visual item’s luminance; for other audiovisual sequences, frequency

and luminance were uncorrelated. Subjects with prior instrumental-training significantly

outperformed their untrained counterparts on both auditory and visual sequences, and on

sequences of correlated auditory and visual items. Reverse correlation analysis revealed

that the correlated, concurrent auditory stream altered how subjects weighted items at

particular ordinal positions in a sequence. Finally, congruence between auditory and visual

items enabled subjects to perform far better than predicted from simple summation of

information from the two modalities, perhaps by engaging special-purpose mechanisms

sensitive to audiovisual correlation.

Keywords

Multisensory, short-term memory, audiovisual integration, modality-appropriateness


Music-related skills are enhanced in people who have been trained to play an instru-

ment (Hyde et al., 2009; Kraus & Chandrasekaran, 2010). Surprisingly, this e�ect includes

superior performance on tasks with little obvious connection to music (Chan, Ho, & Cheung,

1998; Strait, Parbery-Clark, Hittner, & Kraus, 2012; Francois & Schön, 2011; Oxenham,

Fligor, Mason, & Kidd, 2003; Bergstrom, Howard, & Howard, 2012). Various examples

of cross-talk between auditory and visual processing (e.g., Sekuler, Sekuler, & Lau, 1997;

Guttman, Gilroy, & Blake, 2005; Berger & Ehrsson, 2013) led us to hypothesize that musi-

cians, who have had extensive practice with auditory tasks, might also demonstrate superior

visual processing if tested with in an appropriate task.

The selection of an appropriate task took account of Welch and Warren’s (1980)

“modality-appropriateness” conjecture. Specifically, this conjecture asserts that when vi-

sual and auditory processing are compared, the advantage goes to vision when spatial at-

tributes must be processed, but the advantage shifts to audition when temporal attributes

are critical (Welch, 1999; Guttman et al., 2005). The modality-appropriateness conjecture

is supported by recent functional magnetic resonance imaging (fMRI) results. Michalka,

Rosen, Kong, Shinn-Cunningham, and Somers (2012) showed that task demands can dy-

namically recruit di�erent modality-related frontal lobe regions: a visual task entailing rapid

stimulus presentation activates cortical regions normally implicated in auditory attention,

but an auditory task requiring spatial judgements activates regions normally implicated in

visual attention. Together with the modality-appropriateness conjecture, Michalka et al.’s

results suggest that evidence of musicians’ possible superiority in visual processing would

depend upon the temporal characteristics of any test.

For our test, we we chose a paradigm recently introduced by Gold, Aizenman, Bond,

and Sekuler (2013). Building on a paradigm that Agus, Thorpe, and Pressnitzer (2010)

devised for the study of auditory memory, Gold et al. showed subjects rapidly-presented

sequences of quasi-random luminance levels, and asked them to judge whether the second

four luminance levels in each eight-item sequence identically repeated the first four. Stimuli

in their experiments entailed a sequence of rapid variation along what have been described


as “elemental” or “low-level” sensory dimensions (Magnussen, 2000; Pasternak & Greenlee,

2005). Sequences of low-level sensory attributes a�ord useful experimental probes, in part

because they reduce the likelihood that subjects’ performance would be mediated by verbal

labels (Miller & Gazzaniga, 1998; Kahana & Sekuler, 2002). For the present purposes,

such sequences o�ered another potential advantage. Although subjects’ self-reports are

hardly dispositive (Nisbett & Wilson, 1977), some of Gold et al.’s subjects volunteered that

as they observed the visual sequences, they generated subvocal “tunes.” In other words,

they claimed to have recruited auditory imagery for what nominally was a purely visual

task (Berger & Ehrsson, 2013), suggesting a form of cross-talk between modalities that

Guttman et al. (2005) described as “hearing what the eyes see.”

Various evidence of cross-talk between seeing and hearing led us to ask whether musi-

cianship brought enhanced processing of rapidly-presented visual stimuli. For an answer, we

adapted Gold et al.’s paradigm in order to compare music-trained and non-trained subjects

with rapidly-presented stimulus sequences in which luminance levels or auditory frequen-

cies varied (Rammsayer & Altenmüller, 2006). Finally, as many ordinary events generate

multisensory signals, and the confluence of signals from multiple senses can powerfully in-

fluence perception (e.g., Thomas, 1941; Chen & Spence, 2010), we adapted the paradigm

in order to test subjects with multisensory sequences whose co-occurring audio and visual

components were either perceptually congruent or perceptually incongruent.

Method

In all of our test conditions, subjects had to judge whether the first four items in a

rapidly-presented stimulus sequences of eight items did or did not repeat. Figure 1 shows

schematic examples of our unimodal stimuli, with Visual stimuli in Panel A and Auditory

stimuli in Panel B. Items of each type were drawn from a homogeneous pool and were devoid

of semantic content.

As in Gold et al. (2013), visual stimuli were presented against a uniform background

of average luminance 19.03 cd/m2 on a 17” CRT monitor (Sony Trinitron UltraScan P780)


RN

N

Sample Auditory TrialsType

1 sec 1 sec1 sec 1 sec

One trial Another trial

Sample Visual Trials

1 sec 1 sec

RN

N

Type One trial Another trialA

B

Figure 1 . Schematic examples of auditory and visual unimodal stimuli. Panel A: exemplarsof visual stimuli; Panel B: exemplars of auditory stimuli. In each panel, two examples areshown for the N condition (the last four items in an eight-item sequence are uncorrelatedwith the first four) and for the RN condition (the last four items in an eight-item sequencerepeat identically the first four).

with a resolution of 1024◊768 pixels and a refresh rate of 75 Hz. Display luminances were

linearized by means of a calibration-based lookup table. Stimulus sequences were generated

and presented by an Apple iMac computer, using Matlab (version 7.7) and extensions

from the Psychophysics Toolbox (Brainard, 1997). Each visual sequence comprised eight

luminance levels presented in rapid succession to the same 4.1¶◊4.1¶ (128◊128 pixels) region

at the display’s center. Each luminance level in an entire eight-item sequence was presented

for 10 complete refreshes of the CRT screen (≥133 ms), which meant that a complete eight-

item sequence played out in 1,067 ms. A viewing distance of 57 cm was enforced by means

of a chin rest.

Auditory stimuli were seamless streams of eight equal-duration pure tones, each ≥133


ms in duration. These tones were sampled at 44.1 kHz and presented at 70-72 db(A) through

Sennheiser HD280 supra-aural earphones. To eliminate audible transients that would arise

from abrupt changes in frequency from one tone to another, the leading and trailing edges

of each tone were tapered with a raised cosine (≥1.13 msec rise or fall time).

When the experiment’s design called for a multimodal stimulus, auditory and visual

components of the stimulus sequence were presented synchronously. The synchronization

of auditory and visual sequences was assessed using photodiode and microphone inputs to

a dual-trace oscilloscope. Observations showed that two streams were synchronized to ±7

msec.

The stimulus-generation algorithm (see Figure 2) began by drawing eight random

samples from a normal distribution, N (0, 0.2). Samples more than ±2 standard deviations

from the mean were discarded and replaced. Together with the distribution’s relatively

small standard deviation, censoring extreme values served to homogenize items that would

appear in a stimulus sequence. This kept subjects from basing judgments on some highly-

distinctive, “oddball” item or items. As a measure of how well this goal was met, successive

samples in a sequence di�ered by a maximum of 1.57, while 10% of successive samples

di�ered by 0.07 or less, and 50% of successive values di�ered by 0.37 or less.

To determine what luminances would be presented, the eight samples drawn for the

trial were translated into equivalent luminance contrasts. Contrast was defined as (Lpix

-

Lbg

)/Lbg

, where Lpix

is the luminance of a stimulus pixel, and L

bg

is the display’s background

luminance, which was held constant at 19.03 cd/m2. The resulting samples ranged from 2

cd/m2 to 42 cd/m2. When a stimulus sequence included an auditory component, the eight

luminances in the sequence were translated into equivalent pure tones whose frequencies

were a linear function of luminance.

Experiment

Valid comparisons between unimodal and multimodal conditions demand that base-

line performance with Auditory and Visual sequences be equated. Were the separate uni-


modal contributions to a multimodal sequence substantially unequal, performance with a

multimodal sequence would be dominated by the more potent of the two unimodal drivers.

As a first step toward equating performance with Visual and Auditory sequences, we turned

to existing brightness-pitch cross-modal matching results reported by Marks (1974). Marks

asked subjects to adjust the pitch of a tone to perceptually match the brightness of various

achromatic Munsell patches. As we were committed to using the same luminance range

that Gold et al. (2013) had used, these cross-modal matching results dictated that we use

a set of frequencies spanning just over three octaves, 100 to 555 Hz (≥A2˜ to ≥C5˘ on an

equal-tempered musical scale).

A preliminary experiment tested 12 non-musicians with Visual stimuli generated as

in Gold et al. and with Auditory sequences drawn from the frequency range implied by

Marks’s cross-modal matching result, that is, 100–555 Hz. Although performance with

Visual stimuli nicely replicated what had been found previously by Gold et al. with the same

sample size, the Auditory sequences produced d

Õ values considerably higher. In particular,

subjects’ ability to recognize a within-sequence repetition of items was far better with

Auditory sequences than with Visual sequences, mean d

Õ values of 2.49 (SeM=0.15) and

1.23 (SeM=0.14), respectively (p <.01). This substantial mismatch between Auditory and

Visual performance probably reflects the di�erence between conditions used for cross-modal

matching and the conditions confronting our subjects. For example, Marks (1974)’s subjects

matched individual Auditory and Visual items under self-paced viewing and listening times,

and did so under conditions that put no burden on subjects’ memory. In contrast, our

task not only imposed a considerable burden on subjects’ memory, but, as importantly,

presented items in succession at a high rate (8 Hz). Welch and Warren (1980)’s modality-

appropriateness conjecture suggests that our task’s emphasis on temporal attributes of a

sequence would advantage auditory processing over visual processing, just as our preliminary

experiment showed. Moreover, the large di�erence between d

Õ values for Auditory and Visual

sequences is consistent with the idea that perceptual encoding of pitch sequences may

be aided by special-purpose neural mechanisms responsive to frequency shifts (Cousineau,


Demany, & Pressnitzer, 2009).

Whatever its cause, the approximately twofold di�erence in d

Õ values in our prelimi-

nary experiment suggests that if the unimodal stimuli from that experiment were combined

in multisensory sequences, performance would be dominated by the sequences’ Auditory

components, rendering valid assessment of multisensory integration di�cult. To avoid that

likelihood while retaining the luminance range that Gold et al. used, we narrowed the range

of auditory frequencies that would be used in the experiment proper. Specifically, the tones

comprising auditory sequences were drawn from a range of 344 to 400 Hz. In musical terms,

this reduced range of tones went from slightly below F4 to slightly above G4.

Draw 8 random samples from

Gaussian

Translate to luminance values

Unimodal

Visual Auditory

Translate to frequencies;

discard luminances

Multimodal

Present stimulus

AVCongruent

AVIncongruent

Translate luminances to frequencies;

retain both A and V

Generate new random sequence for audio; retain both A and V

Replace last 4 items with copies of first 4

Stimulus is Repeat?

Yes No

Figure 2 . Flowchart for stimulus generation. The steps in the stimulus generation algorithmare explained in the text.

Within each block, stimulus sequences comprised two di�erent structural categories.

In some sequences, hereafter termed “Repeat” (RN) sequences, the last four items in the

sequence repeated the first four items identically and in order; all items were reconstituted

anew for each trial. In other sequences, hereafter called “Noise” (N) sequences, each item


of the eight was the product of an independent sample (see Figure 2); these sequences, too,

were reconstituted anew for each trial. In each block of trials, Repeat and Noise sequences

were randomly intermingled, with both trial types occurring equally often.

With unimodal stimuli, subjects attempted to identify whether halves of an eight-item

stimulus sequence repeated or not. Unimodal stimuli, Visual or Auditory, were presented in

separate blocks of 75 trials each.

Figure 3 presents schematic examples of both classes of multimodal stimuli, AVcon-

gruent and AVincongruent. With multimodal sequences, subjects were instructed to ignore

a sequence’s auditory dimension, and to base judgments solely on variation in the visual

dimension. In order to probe limits on the ability to ignore concurrent Auditory signals,

we devised two classes of multimodal sequences: Congruent sequences, in which variation

in frequency was a monotone function of the accompanying luminance, and Incongruent

sequences, in which variation in frequency was uncorrelated with variation in luminance.

Each sequence comprised eight items presented in rapid succession, at 8 Hz. As it has long

been known that concurrent co-modulation promotes integration or binding of auditory

and visual signals (Thomas, 1941), we hypothesized further that co-modulation would help

subjects to recognize when items in a sequence were repeated.

In order to generate multisensory, Audiovisual stimuli whose Visual and Auditory com-

ponents were incongruent, the stimulus-generation algorithm (Figure 2) drew a second,

“dummy” set of eight luminance samples from the zero-mean Gaussian. The tonal equiva-

lents to members of this new set were derived and substituted for the tonal equivalents to

the luminances already selected for that trial. The result was a set of frequencies that were

uncorrelated with the set of luminances. To produce a Repeat (RN) sequence we replaced

the sequence’s last four items –whether unimodal or multimodal– with exact copies of the

first four items. With this last step, the algorithm could generate any of the stimulus types

that the experiment required.

Audiovisual stimuli were presented in blocks of 150 trials divided approximately equally

between two sequences types. For AVcongruent sequences, an Auditory item’s frequency


RNcon

Ncon

Sample AV Congruent TrialsType One trial Another trial

Nincon

Type One trial

RNincon

Sample AV Incongruent TrialsAnother trial

A

B

Figure 3 . Schematic examples of multimodal stimuli. Panel A: exemplars of stimuli whoseaudio and visual components were congruent; Panel B: exemplars of stimuli was audio andvisual components were incongruent, that is uncorrelated. In each panel, two examples areshown for the N condition (the last four items in an eight-item sequence are uncorrelatedwith the first four) and for the RN condition (the last four items in an eight-item sequencerepeat identically the first four).


was an increasing linear monotone of the luminance of the accompanying Visual item; for

AVincongruent sequences, component luminances were uncorrelated with the frequency of

accompanying Auditory components. AVcongruent and AVincongruent sequences were ran-

domly intermingled within a block of trials, and equal numbers of RN and N sequences were

randomly presented for each type. With all Audiovisual stimuli, subjects were instructed to

ignore the Auditory component, and base their judgments solely on the Visual aspect of the

sequence.

Three hundred milliseconds after a stimulus sequence ended, a message on the screen

prompted the subject for a key press that signaled whether elements in the sequence re-

peated. Feedback, in the form of a text message, followed. Subjects were encouraged to rest

after every 50 trials, but were asked to remain seated throughout the experiment. The order

in which Auditory, Visual and Audiovisual trial blocks were presented was counterbalanced

across subjects. Before beginning the experiment, subjects practiced with 20 trials of each

stimulus type ≠ Auditory, Visual, AVcongruent and AVincongruent.

We tested equal numbers of subjects who had had music training and ones who

had not. Previous results on processing of temporal sequences led us to hypothesize that

musicians would excel not only with auditory sequences, but with rapidly-presented visual

sequences as well (Deliége, Mélen, Stammers, & Cross, 1996; Deliége, 1996). A questionnaire

about music training was used to recruit and assign subjects to two groups, one whose

members experienced extensive musical training, and another whose members had little or

no training. Following Skoe and Kraus (2012), a subject qualified as a ”musician” if he or

she had played one or more musical instruments for six or more years, and was continuing

to play/practice an instrument up to the time of the experiment. A “non-musician” was

defined as someone who either had never played a musical instrument or had played a

musical instrument for three or fewer years, more than six years before study participation.1

The musicians on average had 10.93 years of musical training.

1We recognize that merely having played an instrument for some time does not truly make someone amusician, at least as that term is usually used. However, the terms “musician” and “non-musician” can serveas convenient, if imperfect surrogates.


Fourteen musicians and fourteen non-musicians, all between the ages of 18 and 22

years of age, participated in this experiment. These subject samples were comparable in

size to ones previously tested by Gold et al. (2013) on the same task. Each subject was

compensated $10 for participation. Nine subjects in each group were female. Table 1

summarizes the history of musical training reported by subjects who qualified as Musi-

cians. All subjects had normal visual acuity and hearing, and had best-corrected Snellen

acuity of at least 20/40. Hearing was indexed by a subject’s pure tone average (PTA;

the average threshold in each participant’s better ear for 1, 2, and 4 kHz). All subjects’

PTAs, as measured with a Beltone 120 audiometer, were Æ25 dB(HL), which qualifies as

clinically-normal hearing (Mueller & Hall, 1998). The experimental protocol was approved

by Brandeis University’s Committee for the Protection of Human Subjects.

Table 1Gender and age at which musical training began, years of musical training and instrument(s)played by musically trained subjects. For subjects who reported playing multiple instruments,instruments are listed in order of earliest learned.

Subject Gender Starting Age Musical Training Instrument(years) (years)

1 F 10 12 Violin,Piano2 M 4 15 Cello, Violin, Guitar3 F 5 15 Violin4 F 7 12 Piano, Clarinet, Bass Clarinet5 F 4.5 15 Piano, Drums6 M 6 15 Piano7 F 7 12 Piano, Flute, Saxophone8 F 7 9 Piano9 M 5 9 Piano, Flute, Guitar10 M 10 8 Saxophone, Bass, and Guitar11 M 12 6 Guitar, Piano12 F 9 6 Flute, Piano13 F 7 8 Saxophone, Violin14 F 9 11 Piano, Flute, Guitar


Results

Performance with various stimulus types was measured by each subject’s d

Õ values.

Figure 4 shows musically-trained and non-trained subjects’ mean d

Õ values for Auditory,

Visual, AVcongruent and AVincongruent trials. These results were analyzed with separate

ANOVAs on results from unimodal and multimodal stimuli. The ANOVAs were were fol-

lowed by t-tests, when needed.

First, the two types of unimodal stimuli, Auditory and Visual, proved to be equally

challenging for subjects (F1,26 = .01, p = 0.92, ÷

2 = .023, ). Thus, the goal of equating

the two types of unimodal stimuli was achieved. Second, Musicians outperformed Non-

musicians in distinguishing unimodal random from unimodal repeating sequences (F1,26

= 8.24, p = 0.01, ÷

2 = .241). Finally, Musicians outperformed Non-musicians with both

Auditory (t(23) = 2.47; p = .02) and Visual (t(23) = 2.03; p = .05) sequences.

Turning to Audiovisual stimuli, an ANOVA showed no significant overall di�erence

between Musicians and Non-musicians (F1,26 = 2.66, p = 0.11, ÷

2 = .095), although con-

gruency between Auditory and Visual components did matter: performance was significantly

better with AVcongruent sequences than with AVincongruent sequences (F1,26 = 182.9 9,

p<.00001, ÷

2 = .859). A t-test comparing the performance of Musicians and Non-musicians

for AVcongruent and AVincongruent trials revealed significant di�erences on AVcongruent

trials (t(26) = 2.21 ; p = .04), but not on AVincongruent trials (t(26)=.59 ; p=.56).

The pattern of results shown in Figure 4 led us to ask whether Musicians’ advantage

over Non-musicians with Visual stimuli was exaggerated with stimuli that included Auditory

components. For an answer, we computed two sets of di�erence scores, the first by sub-

tracting a subject’s d

Õ value for Visual sequences from that subject’s d

Õ value for Auditory

sequences, and the second by subtracting a subject’s d

Õ value for Visual sequences from

that subject’s d

Õ value for AVcongruent sequences. Independent-samples t-tests compared

Musicians and Non-musicians on each of these sets of di�erence scores. The advantage that

Musicians showed with Visual stimuli was not significantly enlarged either with Auditory or

AVcongruent stimuli (p=0.08 and p=0.36, respectively; df=23, one-tailed t-tests).


A V AVcon AVincon0

0.5

1

1.5

2

2.5

3

3.5

stimulus type

d pr

ime

d values for experiment 2

MusicianNon−Musician

Figure 4 . Mean d

Õ values. For each condition, the mean d

Õ value for musicians (lighterbar) is stacked atop the corresponding value for non-musicians (darker bar). Error barsrepresent ±1 within-subject standard error

Figure 5 highlights how history of musical training a�ects short-term memory for

sequences of various kinds. Years of music training and d

Õ were significantly correlated for

Auditory sequences (r(26) = .60, p <.05; Panel B) and for AVcongruent sequences (r(26)=.61,

p <.05; Panel C). For Visual sequences, though, the relationship between years of training

and performance failed to reach significance (r(26)=.47, p =.10; Panel A), as did the result

for AVincongruent sequences, (r(26) = .08, p = .80; Panel D). So, with some, but not all

kinds of stimulus sequences, performance is significantly correlated with years of music

training.


0 2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

3

3.5

valu

es o

f d’

musical experience

d’ in relation to musical experience for AVincon trials

MusiciansNon

Musicians

0 2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

3

3.5

valu

es o

f d’

musical experience

d’ in relation to musical experience for AVcon trials

MusiciansNon

Musicians

C D

0 2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

3

3.5

d va

lues

musical experience

d values as a function of musical experience for A trials

MusiciansNonMusicians

Auditory sequences

Years of musical training

d'

0 2 4 6 8 10 12 14 160

0.5

1

1.5

2

2.5

3

3.5

d va

lues

musical experience

d values as a function of musical experience for V trials

MusiciansNonMusicians

Visual sequences

Years of musical training

d'

A B

Years of musical training Years of musical training

Years of musical training Years of musical training

Musicians

Non-Mus.

Non-Mus.Musicians

Non-Mus.

Non-Mus.

Visual Only Auditory Only

A-V Congruent A-V Incongruent

d'd'

d'd'

Musicians

Musicians

Figure 5 . Values of d

Õ as function of years of musical training. Results for unimodalsequences are shown in Panels A and B; results for multimodal sequences are shown inPanels C and D.


Reverse correlation

To determine whether performance di�erences between Musicians and Non-Musicians

were associated di�erences in subjects’ strategies, we turned to reverse correlation analysis

(Ahumada & Beard, 1997; Murray, Bennett, & Sekuler, 2002). This analytic technique

computes the correlation between subjects’ responses across trials and the contrast of each

item in a sequence. The result is a set of weights that shows the relative influence exerted by

each item on subjects’ decisions. Recall that subjects were instructed to judge whether the

last four items in a sequence did or did not identically repeat the first four items. Previously,

applying this analysis to results with Visual sequences in this same task, Gold et al. (2013)

discovered that subjects placed preferential emphasis on the luminance of certain sequence

items. In particular, reverse correlation revealed that subjects gave particular weight to the

final items in each half of an eight-item Visual sequence. To see how adding correlated and

uncorrelated auditory sequences a�ected strategy, we performed the same analysis on Visual,

AVcongruent and AVincongruent data. Specifically, vectors containing the eight contrast

values displayed on a trial were sorted into four possible stimulus-response combinations.

The vectors were then averaged for each stimulus-response combination, and Equation 1

was used to produce the mean kernel c̨

c̨ =!rR + rN

"≠

!nN + nR

", (1)

where xY denotes the combination of response x (either “repeating” or “not repeating”)

and stimulus Y (either Repeat or Noise).

The result, c̨, is an eight-element vector whose values are the relative weights as-

signed to the items in a sequence that was being judged. If an observer’s classification

of a stimulus were uncorrelated with the contrast value at particular ordinal position of a

sequence, the resulting mean kernel for that ordinal position would not significantly di�er

from zero. A positive weight in the mean kernel indicates that positive contrast values

in a sequence promoted “repeating” responses, while a negative weight promoted “non


repeating” responses. Likewise, a negative weight in the mean kernel indicated that pos-

itive contrast values promoted “non repeating“ responses, while negative contrast values

promoted “repeating” responses.

Figure 6 shows the results of this analysis for Visual only (Panel A), AVincongruent

(Panel B) and AVcongruent (Panel C) stimuli. The filled symbols (•) represent Musicians’ re-

sults, and the open symbols (o) represent Non-Musicians’ results. First, consider Visual-only

sequences. These reverse correlation functions strongly resemble ones reported previously

for Visual sequences (Gold et al., 2013). In particular, the fourth and eighth items in an

eight-item sequence have the strongest influence on subjects’ judgments. Moreover, both

Musicians and Non-musicians exhibited this pattern. Thus, di�erences in d

Õ values between

Musicians and Non-Musicians with Visual-only stimuli probably did not reflect di�erences in

overall strategy, but resulted from some other aspect of how each group processed stimulus

information, for example, levels of internal noise (Burgess, Wagner, Jennings, & Barlow,

1981) or uncertainty (Pelli, 1985).

Next, consider results with AVcongruent stimuli. Unlike what was seen with Visual

stimuli, here there is a marked di�erence between Musicians’ and Non-musicians’ strategies.

In particular, Musicians appear to have maintained the same strategy that they used when

no Auditory stream was present; in contrast, Non-musicians show no consistent preferential

weighting for particular items within the eight-item sequence. Gold et al. (2013) showed that

their subjects weighted a sequence’s fourth and eighth items mainly in order to deal with

intrinsic uncertainty about the temporal boundaries of the visual sequences they were seeing.

(Estimates of these boundaries would obviously be an important element in comparing a

sequence’s two halves.) That interpretation predicts that subjects’ performance would be

reduced if they failed to use such a strategy, which is exactly what we found. Thus, it

appears that the presence of a correlated Auditory stream interferes with the ability of Non-

musicians to maintain the strategy that they might otherwise use to overcome the limiting

e�ects of temporal uncertainty. Musicians, on the other hand, appear to be much less

a�ected by the concurrent, correlated Auditory stream.


-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

Rela

tive

Wei

ght

87654321

Item in Sequence

Musician Non-Musician

A-V Congruent

B

-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

Rela

tive

Wei

ght

87654321

Item in Sequence


Visual Only

A

-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

Rela

tive

Wei

ght

87654321

Item in Sequence


A-V Incongruent

C

Visual Only

A-V Congruent

A-V Incongruent




A

B

C

Figure 6 . Reverse correlations based on sequences’ Visual attributes. In each panel, sepa-rate curves are shown for Musicians (•) and Non-musicians (o). Panel A shows the reversecorrelations for Visual unimodal sequences; Panel B shows the reverse correlations for AVin-

congruent sequences; and Panel C shows the reverse correlations for AVcongruent sequences.


Finally, consider results produced with AVincongruent sequences (Panel C). Here, both

Musicians and Non-musicians seem to have failed to maintain the strategy they adopted with

Visual-only stimuli. Further, recall that with AVincongruent stimuli there was no significant

di�erence between Musicians’ and Non-musicians’ d

Õ values. Apparently, the inclusion of an

uncorrelated Auditory stream undermines subjects’ ability to selectively weight key items in

a sequence.

Discussion

Our results demonstrated that Musicians have enhanced ability to detect repetitions

of items within rapidly-presented sequences, and that this enhanced ability extends not

only to Auditory sequences, but also to Visual, and correlated Audiovisual sequences as well.

This result may reflect the e�ects of training to play an instrument, especially given that

performance does correlate with years of training (Figure 5B and C). However, our results

should not encourage non-musically trained readers to rush headlong into taking up a mu-

sical instrument. The mere fact that Musicians outperform Non-musicians on our task does

not mean that music training per se is responsible (see, Morrison & Chein, 2011). After all,

a person who, absent training, would have excelled on the task anyway might have been

more inclined to start such training. Further, pre-existing talent for processing Auditory

sequences might encourage a person not only to initiate, but also to persist in learning to

play an instrument.

A proper test of causal linkage between music training and performance on a task

like ours requires that researchers begin with subjects who had never had training to play

an instrument. Some of those naive subjects would be randomly assigned to undergo in-

strument training for an extended period, while control subjects receive some equivalent

non-instrument training for the same period (Barrett, Ashley, Strait, & Kraus, 2013). Ide-

ally, the e�ect of di�erential training would be assessed not only via di�erential behavioral

changes, but also in terms of correlated changes in brain. We know of just one study that

meets these stringent criteria. In that study, Hyde et al. (2009) constituted two groups of


children who were a bit over six years old at the start of the 15-month study. One group re-

ceived weekly, 30-minute private keyboard lessons during the study; a second group received

no instrumental music training, but instead participated in weekly 40-minute group music

classes, during which they sang and played with drums and bells. Comparisons of pre- and

post-treatment measures showed that instrumental training di�erentially enhanced ability

to distinguish between pairs of five-tone musical phrases that di�ered either in melody, that

is, in pitch sequence, or in rhythm (Overy et al., 2004). Moreover, analysis of magnetic

resonance images captured at the start and end of the study revealed that training-induced

improvement on the melodic/rhythmic discrimination test were correlated with deformation

changes in a key auditory area of subjects’ right hemispheres, that is, the lateral aspect of

Heschl’s gyrus.

The criteria we and others used to distinguish Musicians from Non-musicians have

obvious limitations. In fact, the challenge of precisely defining what constitutes a “musician”

is well-known (Levitin, 2012). Following the lead of Skoe and Kraus (2012), we defined

musician status mainly by subjects’ self-reports of how long they had played an instrument.

Of course, not every person who receives many years of music instruction or engages in years

of continuous practice achieves a level of proficiency that would satisfy common definitions of

“musician.” Conversely, a person might possess su�cient native talent that he or she could

achieve very high proficiency in very short order. Additionally, there likely are multiple

di�erences between our two groups, including di�erences in auditory imagery (Brown &

Palmer, 2013; Keller, Dalla Bella, & Koch, 2010), perceptual grouping (Kung, Tzeng, Hung,

& Wu, 2011), and other, general cognitive factors as well. By assessing multiple dimensions

of auditory experience and other cognitive attributes, future studies could contribute toward

a more complete definition of “musicianship.”

Of course, playing a musical instrument does not require processing quasi-random

luminance sequences like the ones in our Visual or Audiovisual stimuli. However, playing

an instrument could entail the translation of spatial information into temporal sequences,

as one does, for example, when reading music. This spatio-temporal translation might be


expedited by the spontaneous, natural mapping of pitch onto the visual feature of vertical

location (Evans & Treisman, 2010). Mindful of the role that spatial information might

play for musicians, Bergstrom et al. (2012) tested the speed and accuracy when subjects

learned to make key presses to each of a series of targets presented at di�erent locations on a

computer screen. Embedded in the sequence of events was a sub-sequence in which events’

locations were governed by the rules of an artificial grammar (e.g., Reber & Millward,

1968). Using much the same definition of “musician” that we did, Bergstrom et al. found

that Musicians’ implicit learning of sequential regularities was better than that of Non-

musicians. This result suggests that among the skills on which musicians excel is the skill

of implicitly learning and remembering quasi-random spatio-temporal sequences.

Figure 4 showed that even though subjects were instructed to focus exclusively on

the Visual aspect of an Audiovisual sequence, the presence of a concurrent, congruent Audi-

tory sequence boosted performance considerably over what was seen with either unimodal

sequence. Although our data do not support formal model selection, we can compare this

Audiovisual e�ect against what would be expected from one simple, widely-used benchmark.

Imagine that two orthogonal signals were processed by independent mechanisms, A and V ,

whose noise was uncorrelated. Under such conditions, with each sensitivity expressed as

d

Õ, the response to the combination of the two signals would beÔ

d

ÕV

2 + d

ÕA

2 (Green &

Swets, 1966; Green, 1958; Viemeister & Wakefield, 1991). A t-test confirmed that AVcon-

gruent sequences boosted performance to a level well above the predicted value (t(23)=-3.46,

p <0.001). For the sake of completeness, we also compared performance with AVincongruent

sequences against performance with each type of unimodal sequence. Neither comparison

was statistically significant (p=.51 and .33, for t-tests against Auditory and Visual sequences,

respectively). Returning to the surprisingly powerful advantage seen with AVcongruent se-

quences, it should be noted that the super-additivity of Auditory and Visual components in

such sequences was produced despite the fact that those separate unimodal aspects were

strongly correlated, that is, distinctly non-orthogonal. As this surprising result may be

valuable in informing theories of multisensory integration, the boundary conditions on this


result demand further study. For example, it may be this apparent super-additivity reflects

the engagement of mechanisms specialized for multisensory coincidence or congruence (e.g.,

Bushara et al., 2003; Kayser, Logothetis, & Panzeri, 2010; Orchard-Mills et al., 2013).

We should note that one recent study did not find di�erences in Musicians’ and Non-

musicians’ visual memory. Di�erences between that study and our own may be instructive,

so they are worth considering in some detail. For stimuli, Cohen, Evans, Horowitz, and

Wolfe (2011) chose pictures of objects, abstract art, speech clips and clips of familiar music.

Stimuli were presented one at time, for five seconds each. After all stimuli of one class had

been presented, the researchers tested recognition memory by presenting intermixed old

(previously presented) and new (novel) stimuli, noting how well subjects correctly catego-

rized these intermixed stimuli as “old” or “new”. The key result for the present discussion is

that Musicians and Non-musicians did not di�er in recognition memory for visual stimuli.

Of course, multiple task-related di�erences make it di�cult to compare Cohen et al.’s results

to ours. These di�erences include (i) the types of stimuli used (low-level, elemental features

vs. higher-level stimuli, such as familiar tunes), (ii) the temporal characteristics of stimulus

presentation (rapid presentation of item sequences, which worked against online rehearsal,

vs. five seconds per individual item), and (iii) the task (recognizing within-trial repetitions

of items vs. longer term recognition of single items ). Although any or all of these di�erences

could account for the di�erence between Cohen et al.’s results and our own, it seems ad-

visable that when researchers want to assess musicians’ and non-musicians’ visual memory,

their choice of stimuli and test task should take account of the modality-appropriateness

conjecture (Welch & Warren, 1980).

In our study, Audiovisual congruence was defined by a positive monotone relationship

between an item’s luminance and the frequency of an accompanying tone. In our study

Audiovisual congruence could be described as an all-or-none phenomenon: while components

of an AVcongruent sequence were perfectly correlated, components of an AVincongruent

sequence were on average completely unrelated. Of course, one could devise Audiovisual

sequences in which the correlation between Auditory and Visual components was neither


1.0 nor 0.0, but various values between. Such partially-correlated stimuli, like ones Agus

and Pressnitzer (2013) used to study memory for Auditory sequences, could be leveraged

to identify the strategies subjects drew on. It is worth noting that Audiovisual congruence

could take on forms other than the one we implemented (see review in Evans & Treisman,

2010). One such form would exploit the normal Audiovisual congruence that is characteristic

of speech production. Speech production requires movements of the mouth and face, which

produce a reliable correlation between the auditory output of the vocal tract, on one hand,

and visual motion cues, on the other. It has long been known that speech-related visual

cues alter the intelligibility and detectability of heard speech (e.g., Campbell, 2008). In fact,

the congruence between a speaker’s mouth and lip movements and the sound uttered by

the speaker is basis of the well-known McGurk-MacDonald e�ect (1976) in which altering

the normal relationship between a spoken sound and the accompanying movements of the

mouth distorts a listener’s perception of that sound. Face-to-face speech can be described

as an inherently multisensory phenomenon (Chandrasekaran, Lemus, Trubanova, Gondan,

& Ghazanfar, 2011; Chandrasekaran, Lemus, & Ghazanfar, 2013). It is noteworthy that

this form of Audiovisual congruence extends to situations seemingly far removed from face-

to-face speech. In particular, this form of Audiovisual congruence was recently incorporated

into a first-person fisherman computer game (Sun, Shinn-Cunningham, Somers, & Sekuler,

2014; Bensussen et al., 2014). In that game, responses to computer-generated fish that

swam across a video display were considerably speeded when the amplitude modulation of

a sound emitted by a fish was correlated with the periodic fluctuations in the fish’s size.

Finally, when presented with Audiovisual sequences, our subjects were instructed to

attend only to variations along the Visual dimension, ignoring the accompanying Auditory

dimension. Classification decisions, they were told, should be based only on the correspon-

dence between luminance sequences in a stimulus’ first and second halves. This instruction

to attend only to sequence’s visual dimension, was reinforced by the fact that explicit feed-

back after every response was contingent solely on the relationship between the subject’s

judgement and the stimulus’ Visual dimension. It is unknown how our subjects’ perfor-


mance would have been impacted had they been given the inverse instruction, that is,

to base decisions on Auditory signals, while ignoring Visual ones, although results from a

speeded-categorization task do suggest that with Audiovisual stimuli, leakage from a nomi-

nally unattended modality is bidirectional (Bensussen et al., 2014).

References

Agus, T. R., & Pressnitzer, D. (2013). The detection of repetitions in noise before and after

perceptual learning. Journal of the Acoustical Society of America, 134 (1), 464–473.

Agus, T. R., Thorpe, S. J., & Pressnitzer, D. (2010). Rapid formation of robust auditory

memories: insights from noise. Neuron, 66 (4), 610–618.

Ahumada, A. J., Jr, & Beard, B. L. (1997). Image discrimination models predict detection

in fixed but not random noise. Journal of the Optical Society of America. A, Optics,

image science, and vision, 14 (9), 2471–2476.

Barrett, K. C., Ashley, R., Strait, D. L., & Kraus, N. (2013). Art and science: how musical

training shapes the brain. Frontiers in Psychology, 4, 713.

Bensussen, S., Chou, K. F., Varghese, L., Sun, Y., Somers, D. C., Shinn-Cunningham,

B., & Sekuler, R. (2014). Bidirectional audiovisual interactions: Evidence from a

computerized fishing game. Providence, R I: Meeting of the Acoustical Society of

America.

Berger, C. C., & Ehrsson, H. H. (2013). Mental imagery changes multisensory perception.

Current Biology, 23 (14), 1367–1372.

Bergstrom, J. C. R., Howard, J. H., & Howard, D. V. (2012). Enhanced implicit sequence

learning in college-age video game players and musicians. Applied Cognitive Psychol-

ogy, 26, 91–96.

Brainard, D. H. (1997). The psychophysics toolbox. Spatial Vision, 10, 433-436.


Brown, R. M., & Palmer, C. (2013). Auditory and motor imagery modulate learning in

music performance. Frontiers in Human Neuroscience, 7, 320.

Burgess, A. E., Wagner, R. F., Jennings, R. J., & Barlow, H. B. (1981). E�ciency of human

visual signal discrimination. Science, 214 (4516), 93–94.

Bushara, K. O., Hanakawa, T., Immisch, I., Toma, K., Kansaku, K., & Hallett, M. (2003).

Neural correlates of cross-modal binding. Nature Neuroscience, 6 (2), 190–195.

Campbell, R. (2008). The processing of audio-visual speech: empirical and neural bases.

Philosophical Transactions of the Royal Society of London, Series B, Biological Sci-

ences, 363 (1493), 1001–1010.

Chan, A. S., Ho, Y. C., & Cheung, M. C. (1998). Music training improves verbal memory.

Nature, 396 (6707), 128.

Chandrasekaran, C., Lemus, L., & Ghazanfar, A. A. (2013). Dynamic faces speed up

the onset of auditory cortical spiking responses during vocal detection. Proceedings

of National Academy of Sciences of the United States of America, 110 (48), E4668–

E4677.

Chandrasekaran, C., Lemus, L., Trubanova, A., Gondan, M., & Ghazanfar, A. A. (2011).

Monkeys and humans share a common computation for face/voice integration. PLoS

Computational Biology, 7 (9), e1002165.

Chen, Y. C., & Spence, C. (2010). When hearing the bark helps to identify the dog:

Semantically-congruent sounds modulate the identification of masked pictures. Cog-

nition, 114 (3), 389-404.

Cohen, M. A., Evans, K. K., Horowitz, T. S., & Wolfe, J. M. (2011). Auditory and visual

memory in musicians and nonmusicians. Psychonomic Bulletin & Review, 18 (3),

586–591.


Cousineau, M., Demany, L., & Pressnitzer, D. (2009). What makes a melody: The percep-

tual singularity of pitch sequences. The Journal of the Acoustical Society of America,

126 (6), 3179–3187.

Deliége, I. (1996). Cue abstraction as a component of categorisation processes in music

listening. Psychology of Music, 24 (2), 131-156.

Deliége, I., Mélen, M., Stammers, D., & Cross, I. (1996). Musical schemata in real time

listening to a piece of music. Music Perception: An Interdisciplinary Journal, 14 (4),

117-160.

Evans, K. K., & Treisman, A. (2010). Natural cross-modal mappings between visual and

auditory features. Journal of Vision, 10 (1), 11507–11510.

Francois, C., & Schön, D. (2011). Musical expertise boosts implicit learning of both musical

and linguistic structures. Cerebral Cortex, 21 (10), 2357-65.

Gold, J. M., Aizenman, A., Bond, S. M., & Sekuler, R. (2013). Memory and incidental

learning for visual frozen noise sequences. Vision Research.

Green, D. M. (1958). Detection of multiple component signals in noise. The Journal of the

Acoustical Society of America, 50, 904–911.

Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. New York:

Wiley.

Guttman, S., Gilroy, L. A., & Blake, R. (2005). Hearing what the eyes see: auditory

encoding of visual temporal sequences. Psychological Science, 16 (3), 228-35.

Hyde, K. L., Lerch, J., Norton, A., Forgeard, M., Winner, E., Evans, A. C., & Schlaug,

G. (2009). Musical training shapes structural brain development. Journal of Neuro-

science, 29 (10), 3019–3025.

Kahana, M. J., & Sekuler, R. (2002). Recognizing spatial patterns: a noisy exemplar

approach. Vision Research, 42 (18), 2177–2192.


Kayser, C., Logothetis, N. K., & Panzeri, S. (2010). Visual enhancement of the information

representation in auditory cortex. Current Biology, 20 (1), 19–24.

Keller, P. E., Dalla Bella, S., & Koch, I. (2010). Auditory imagery shapes movement timing

and kinematics: evidence from a musical task. Journal of Experimental Psychology:

Human Perception and Performance, 36 (2), 508–513.

Kraus, N., & Chandrasekaran, B. (2010). Music training for the development of auditory

skills. Nature Review Neuroscience, 11 (8), 599-605.

Kung, S.-J., Tzeng, O. J. L., Hung, D. L., & Wu, D. H. (2011). Dynamic allocation of

attention to metrical and grouping accents in rhythmic sequences. Experimental Brain

Research, 210 (2), 269–282.

Levitin, D. J. (2012). What does it mean to be musical? Neuron, 73 (4), 633-7.

Magnussen, S. (2000). Low-level memory processes in vision. Trends in Neuroscience,

23 (6), 247–251.

Marks, L. E. (1974). On associations of light and sound: The mediation of brightness, pitch

and loudness. The American Journal of Psychology, 87, 173-188.

McGurk, H., & MacDonald, J. (1976). Hearing lips and seeing voices. Nature, 265, 746–748.

Michalka, S., Rosen, M., Kong, L., Shinn-Cunningham, B., & Somers, D. (2012). fMRI

investigations of temporal sequence processing in visual short-term memory of humans.

(Poster presented at SFN2012 Conference)

Miller, M. B., & Gazzaniga, M. S. (1998). Creating false memories for visual scenes.

Neuropsychologia, 36 (6), 513–520.

Morrison, A. B., & Chein, J. M. (2011). Does working memory training work? the promise

and challenges of enhancing cognition by training working memory. Psychononic

Bulletin & Review, 18 (1), 46–60.


Mueller, G., & Hall, J. W. (1998). Audiologist’s Desk Reference: Audiolologic Management,

Rehabilitation and Terminology (Vol. II). Singular Publishing Group, Inc.

Murray, R. F., Bennett, P. J., & Sekuler, A. B. (2002). Optimal methods for calculating

classification images: weighted sums. Journal of Vision, 2 (1), 79–104.

Nisbett, R., & Wilson, T. D. (1977). Telling more than we can know: Verbal reports on

mental processes. Psychological Review, 84 (3), 231-259.

Orchard-Mills, E., Leung, J., Burr, D., Morrone, M. C., Wufong, E., Carlile, S., & Alais, D.

(2013). A mechanism for detecting coincidence of auditory and visual spatial signals.

Multisensory Research, 26 (4), 333–345.

Overy, K., Norton, A. C., Cronin, K. T., Gaab, N., Alsop, D. C., Winner, E., & Schlaug,

G. (2004). Imaging melody and rhythm processing in young children. NeuroReports,

15, 1723–1726.

Oxenham, A. J., Fligor, B. J., Mason, C. R., & Kidd, G., Jr. (2003). Informational masking

and musical training. The Journal of the Acoustical Society of America, 114 (3), 1543–

1549.

Pasternak, T., & Greenlee, M. W. (2005). Working memory in primate sensory systems.

Nature Review Neuroscience, 6 (2), 97–107.

Pelli, D. G. (1985). Uncertainty explains many aspects of visual contrast detection and

discrimination. Journal of Optical Society of American, A, 2 (9), 1508–1532.

Rammsayer, T., & Altenmüller, E. (2006). Temporal information processing in musicians

and nonmusicians. Music Perception: An Interdisciplinary Journal, 24, 37–47.

Reber, A. S., & Millward, R. B. (1968). Event observation in probability learning. Journal

of Experimental Psychology, 77 (2), 317–327.

Sekuler, R., Sekuler, A. B., & Lau, R. (1997). Sound alters visual motion perception.

Nature, 385 (6614), 308.


Skoe, E., & Kraus, N. (2012). A little goes a long way: How the adult brain is shaped by

musical training in childhood. The Journal of Neuroscience, 34, 11507–11510.

Strait, D. L., Parbery-Clark, A., Hittner, E., & Kraus, N. (2012). Musical training during

early childhood enhances the neural encoding of speech in noise. Brain and language,

123 (3), 191-201.

Sun, Y., Shinn-Cunningham, B., Somers, D., & Sekuler, R. (2014). Multisensory learn-

ing and integration in a first-person fisherman game. Boston, MA: Meeting of the

Cognitive Neuroscience Society.

Thomas, G. (1941). Experimental study of the influence of vision on sound localization.

Journal of Experimental Psychology, 28, 167-177.

Viemeister, N. F., & Wakefield, G. H. (1991). Temporal integration and multiple looks.

The Journal of the Acoustical Society of America, 90 (2 Pt 1), 858–865.

Welch, R. B. (1999). Meaning, attention and the "unity assumption" in the intersensory bias

of spatial and temporal perceptions. In G. Aschersleben, T. Bachmann, & J. Müsseler

(Eds.), Cognitive contributions to the perception of spatial and temporal events. (pp.

371–387). Amsterdam: Elsevier.

Welch, R. B., & Warren, D. H. (1980). Immediate perceptual response to intersensory

discrepancy. Psychological Bulletin, 88 (3), 638–667.

Musicians rock on short-term memory and multisensory...

Documents

Transcript of Musicians rock on short-term memory and multisensory...