Vibrato and vowel identification - Royal Institute of ... · STL-QPSR 2-3/1975 Introduction Vibrato...
Transcript of Vibrato and vowel identification - Royal Institute of ... · STL-QPSR 2-3/1975 Introduction Vibrato...
Dept. for Speech, Music and Hearing
Quarterly Progress andStatus Report
Vibrato and vowelidentification
Sundberg, J.
journal: STL-QPSRvolume: 16number: 2-3year: 1975pages: 049-060
http://www.speech.kth.se/qpsr
STL-QPSR 2-3/1975
111. MUSICAL ACOUSTICS
A. VIBRATO AND VOWEL IDENTIFICATION
J. Sundberg
Abstract
The influence of vibrato on vowel identification i s studied in synthesized vowel sounds with fundamental frequencies between 300 and 1000 Hz. Phonetically trained subjects were asked to identify these stimuli presented with and without vibrato a s any of 12 Swedish long vowels. Small and occasional effects a r e ob- served on the mean formant frequencies and the scatter between the responses.
However, the effects a r e greater and more frequent$haai would be expected by chance. In a majority of the cases, where the vibrato affected the responses, the vowel identification became somewhat harder when the stimulus was presented with vibrato.
STL-QPSR 2-3/1975
Introduction
Vibrato occurs in music performed on many types of instruments,
such a s s t r ing and wind instruments and i n singing. Physically it cor - responds to an almost sinusoidal modulatio~t of the fundamental frequency.
Provided that the r a t e and extent of the vibrato i s kept within certain l im-
i t s , the vibrato i s generally considered a useful means of musical ex-
pression in a variety of musical styles. I
Acoustic character is t ics typical of music sounds can be assumed to
serve some purpose in musical communication. One commonly heard
hypothesis i s that the vibrato covers up small e r r o r s in the fundamental
frequency. In a previous investigation no support was found for this hypo-
thesis a s fa r a s the pitch of single tones - and thus not chords - i s con-
cerned (Sundberg 197 2). In the present investigation another commonly
heard hypothesis will be tested, namely that the vibrato facil i tates vowel
identification. This hypothesis seems plausible, because as the part ia ls
move in frequency (e . g. owing to a vibrato) their amplitudes scan small
p a r t s of the vocal-tract t ransfer function, and m o r e information on the , formant frequencies i s communicated. The effect would be particularly
strong in the soprano range where the fundamental frequency may be even
higher than 1000 Hz and the information on the formant frequencies would
be ha rd to decode f r o m the acoustic spectrum.
Experiment
A se r i e s of high-pitched sounds were synthesized with and without
vibrato using a terminal analogue. Six formant-frequency combinations
were used, each of which corresponds to a Swedish long vowel: [u , o,
a , e , i , y ] . The vowel formant frequencies ( s e e Table 111-A-I)
were typical of soprano singing according to previous measurements on
a singer singing a t a fundamental of 262 Hz (Sundberg 1975). Contrary to
what was found in this soprano, each of these s ix combinations of formant
frequencies were used without changes for four different pitches within
the soprano range: 300, 450, 675, and 1000 Hz. In this way the mater ia l
included sounds ranging f rom maximum ease to maximum difficulty a s
regards vowel identification. The vibrato consisted of a sinusoidal modu-
lation of the fundamental frequency. The r a t e was six ondulations per
second, and the extent was 50 cents. These values a r e found in normal
singing, even though the extent i s normally slightly narrower.
Table 111-A-I.
F o r m a n t f requency va lues used for syn thes i s of s t imul i ( lef t)
and a s c r i b e d t o r e s p o n s e vowels ( r ight ) .
I STIMULI RESPONSE VOWELS
Vowel
u
o
a a
a3
E
e
1
Y u
d ce
1 Hz
325
430
640
400
270
250
=2 H z
700
650
1090
2000
1900
1950
F3 H z
F4 H z
= 1 H z
I
2700 1 4000 320
415
700
855
815
625
375
275
275
300
390
565
2820
2800
2700
1950
2510
F2 H z
4000
4000
4000
4000
4000
Hz F3
565
725
1065
1200
2030
2305
2790
2675
2555
1920
2040
1290
2775
2775
3000
2820
3000
3040
3360
3655
3210
2745
2725
2690 J
STL-QFSR 2-3/1975 52.
Four samples of each vowel stimulus were tape recorded a t a constant
overall amplitude. The t ime and the shape of the onsets and decays were
a l l s imilar . The stimuli samples were arranged in random order avoid-
ing m o r e than one repetition of a given stimulus in sequence. The ent i re
tape contained a total of 192 vowel samples: six formant-frequency com-
binations x four fundamental frequencies x four presentations x two cases
(with and without vibrato). Each stimulus had a duration of I sec and was
followed by a silent interval of 4 sec.
The stimuli were presented in head-phones (Sennheiser HD 414) a t an
overall SPL of 70 dB, approximately, to ten phonetically trained subjects.
Their task was to decide which one of 12 given Swedish long vowels [ u, o,
a, a, ;e, E, e, i , y, u, 4 , c e ] sounded most s imilar to the stimulus they
heard and to write that vowel in phonetic symbols on a tes t chart. These
vowels suggested by the subjects a s interpretations of the stimuli will be
r e fe r red to a s response vowels. The test , including three short breaks,
took, i n all , about 30 min.
Evaluation of r e snons e s
I t seems reasonable to hypothesize that the process of identifying an I
isolated stimulus a s a specific vowel involves three major steps. The
f i r s t is a purely sensory one in which the acoustic stimulus i s converted
into some kind of sensory information. The second step involves an ex-
traction of an ensemble of sensory character is t ics f rom this sensory
information. In the third step a comparison i s made between these sen-
sory character is t ics and corresponding character is t ics of vowels s tored
in the subject ' s internal reference. The difficulty of identifying a sound
a s a specific vowel would depend on the degree of s imilar i ty between the
ensemble of sensory character is t ics and the character is t ics in the internal
reference.
The vibrato may affect the sensory information in various ways. Let
us assume that i t affects the extraction of sensory character is t ics . If so,
the responses for a given vibrato stimulus should differ f rom those ob-
tained when the same stimulus was presented without vibrato. In the tes t
responses the effect would then be that the mean formant frequencies of
the response vowels for a given stimulus would differ between the cases
with and without vibrato. In that case the identification difficulties may
STL-QPSR 2- 3/1975 53.
be affected i n various ways, depending on whether the vibrato makes the
sensory character is t ics (1) more s imi lar , (2) l e s s s imi lar , o r (3) equally
s imilar to the subject' s reference features. The cases (1) and (2) would
affect the scatter of the responses.
The above considerations show the need for measures of the average and
scat ter of the response vowels. In order to obtain such measures a set of
formant frequencies was ascr ibed to each of the 12 response vowels in ac-
cordance with data on female speakers , given by Fant (197 3). These values
a r e l is ted in Table 111-A-I. I t should be noted that the same formant f re -
quencies were ascr ibed to a given response vowel regard less of the funda-
mental fr equency of the corresponding stimulus. The formant frequencies
were expressed in .Mels and the mean formant frequencies of the response
vowels were computed for each stimulus.
Resul ts
The responses a r e l is ted in Table 111-A-11. One of the subjects refused
to give an answer to three of the stimuli, a l l of which happened to lack vib-
rato. The inter- subject differences rkgarding the internal reference were
ra ther small: three o r four unique votes on a specific response vowel were
given by the same subject only occasionally, a s can be seen in Table 111-A-II.
The mean formant frequencies of the response vowels a r e shown in Fig.
111-A- 1 and the corresponding standard deviations a r e given in Table 111-A-111.
The relationships between the mean formant frequencies of the r espon-
s e s and the stimulus spectra can be studied in Fig. 111-A-1, where, in
par t icular , the case of the stimuli synthesized with [I?]-formants offer an
i l lustrative example (Fig. 111-A- lc). The mean formant frequencies of the
responses drop with r is ing fundamental frequency up to 675 Hz. The r ea -
son for this is probably that, in o rde r to retain the vowel quality when the
fundamental is ra i sed , i t i s necessary to r a i s e the formant frequencies
slightly. Slawson (19 68) found an average increase of 10 b/o in the formant
frequencies to be required per octave r i s e of the fundamental. If the funda-
mental i s ra i sed while the formant frequencies a r e kept constant, a s was
done in our experiment, the vowel quality will shift towards a vowel with
lower formant frequencies. This effect can be seen i n the case of Fig.
111-A- 1 c. Presumably, the subjects identified the two lowest formant f re -
quencies correctly, and the formant frequencies ascr ibed to the response
vowels contain an e r r o r which i s dependent on the fundamental frequency.
A correction would be needed, but i t i s not known exactly how this cor rec-
tion should be made. F o r instance, Slawson's correction of 10 % in-
c rease in formant frequency pe r octave r i s e in the fundamental does not
FUNDAMENTAL FREQUENCY (kHz)
Fig. 111-A-la-c. Evaluation of response vowels in t e r m s of their two lowest main formant frequencies. Fil led and open c i r c l e s per ta in to s t imul i presented with and without vibrato, r e spec - tively. The thin horizontal dashed lines, show the formant f requencies used in s y n - thesizing the st imuli . Each of the s ix graphs r e f e r s to the r e sponses collected for a given combination of synthesis formant frequencies (cf. Table 111-A-I).
MEAN FORMANT FREQUENCY OF RESPONSES (MEL)
8 8 8 b 8 0 0 0
t I I 1 I I
I I I \ \\, k - I
P
& cn
$
s - I
1 I
I I
1 I I 1 I
1 I I I 1 \\ i I 1 I I I I I i I
k - I I I
- ( I
-* I cn ' I
I I @
t~ 1 4 I \ I
1 \ I ;5 - d 1 b 1
1 I I
I I I - I I I I 1 I I I - 1 I I 1 I I I I I 1 1
1 tl 6 1 I
STL-OPSR 2-3/1975
Table 111-A-11.
STIMULUS FORMANTS
Response u o a e i Y
Number of votes on the response vowels collected for the stimuli indicated. The stimuli a r e given in t e r m s of the formant-freluency combination used in synthesizing them ( see Table 111-A-I), - and t indicate that the stimulus lacked or had vibrato, respectively. Sc indicates that a l l votes s tem f rom one subject only.
STL-QPSR 2 - 3 / 1 9 7 5
Standard deviations for the mean formant frequencies in Mels of the response vowels. The stimuli are given in terms of their formant frequency combinations (FN) and fundamental frequency (I?*).
STL-QPSR 2-3/1975 57.
fully compensate the effect in Fig. III-A- lc. I t should a l so be mentioned,
that a lmost invariably, a l l stimuli with a fundamental of 67 5 Hz gave r e -
sponse vowels in which one of the (assumed) formants matched one of the
stimulus par t ia ls in frequency almost perfectly. This fact may tempt us
to speculate that the formant frequencies ascr ibed to the response vowels
may be cor rec t when the pitch i s very high. Some additional support for
the same speculation may be found in the close frequency agreement be-
tween the formants of the responses and those of the stimulus in the case
of the 1000 Hz fundamental in Fig. 111-A- l c . However, i t s eems definite-
ly unsafe to study the relationships between the stimulus spectrum and the
mean formant frequencies of the responses i n our material . These f re -
quencies should not be compared between stimuli with differing funciamental
frequency.
A comparison of the mean formant frequencies of the response vowels
obtained with and without vibrato would inform u s on the effect that the vibrato
may have on the extraction of sensory character is t ics , a s mentioned. Such
comparisons can be made i n Fig. III-A-2. I t shows the difference in the
mean formant frequencies of the response vowels. Negative values indi-
cate that the mean dropped when the stimulus had a vibrato. According to
a t - tes t the difference i s significant above a 90 O/o level in 13 cases out of
72. This suggests that, under cer tain conditions, the vibrato has an effect
on the extraction of the charactexistics f r o m the sensory information on
the stimulus.
The scat ter between the responses can be computed a s the mean per -
ceptual distance between them. Recently Lindblom (1 97 5) following ideas
of Plomp (1970) has established that the perceptual distance D between any
two Swedish vowels can be approximated a s
where A M i s the frequency difference in d s in the n:th formant between n the two vowels. Using this equation the perceptual distance between the
mean formant frequencies of the response vowels ( see Fig. III-A- 1)
and the individual responses was calculated for each stimulus, and the
average was determined. These mean perceptual distances between the
responses can be studied in Fig. III-A- 3. They can be assumed to r e -
flect the difficulty with which the stimulus is identified a s a specific
FUNDAMENTAL FREQUENCY (kHz)
Fig. 111-A-2. Difference in the first (left), second (middle), and third (right) mean formant frequencies in Mels of the response vowels collected from a given stimulus when it was presented with and without vibrato. The circled dots represent differences that are significant above a 90 O/o level.
- -
STL-QPSR 2- 3/1975 58.
vowel. In the back vowels the identification seems to be simple when the
fundamental is 300 Hz, and a s the fundamental i s ra i sed , the identification
becomes m o r e difficult. In the front vowels the identification is ha rd even
when the fundamental i s low. This i s certainly a consequence of the fact
that the formant frequencies used in synthesizing these stimuli were rath-
e r devient f rom those which a r e typical of normal speech. Rather they
a r e typical of the type of singing which i s often called "covered". In
these stimuli the uncertainty of the identification is not substantially af-
fected by the fundamental frequency. Comparing the values pertaining to
the cases of with and without vibrato revea ls that the identification of vib-
r a to stimuli is harder in approximately a s many cases a s i t i s simpler.
However, the cases of particular interest a r e those which, according to
the t - tes t , changed signfikantly in the mean formant frequency of the r e -
sponse vowels. This occurred in ten cases , and he re the outstanding
features shifted somewhat when the stimulus was presented with a vibrato.
Out of these ten cases the identification became more uncertain in seven
cases. This seems to support the assumption that the vibrato may have
a slight effect on the difficulty with which a sound i s identified a s a vowel,
and the effect i s that the identification i s somewhat harder to per form
when the sound has a vibrato. However, the effect is ve ry weak in our
material .
I t may be argued that the stimuli used in our experiment a r e typical
neither of singing, nor of speech, and hence the subjects' react ions have
l i t t le relevance to the pract ical situation. I t is then interesting to com-
p a r e our resu l t s with data collected f rom similar experiments where the
stimuli were natural, non- synthesized vowels. Such re su l t s have been
presented by Stumpf ( 19 2 6 ) . In h is experiments untrained and profession-
a l s ingers sang high-pitched vowels, and a jury attempted to identify the
intended vowels. Stumpf gave h is r e su l t s in t e r m s of a confusion ma t r ix
and they vYrer e converted into mean per ceptual distance between response
vowels in exactly the same way a s the resu l t s f rom our experiment.
Stumpf' s data a r e a l so given in Fig. 111-A-3. I t is clear f rom the figure
that our stimuli did not give extreme re su l t s a s compared with the natural
stimuli. Thus, vowel identification was not m o r e difficult in our experi- .
ment than i t may be in practice.
STL -0PSR 2- 3/ 197 5 60.
vibrato may change the vowel quality slightly. In such cases , the identifica-
tion of the sound a s a specific vowel seems to be slightly m o r e difficult
when the stimulus h a s a vibrato. I t i s believed that in pract ice, a singer
may profit systematically f rom this effect of the vibrato so a s to reduce
the per ceptability of her deviations f rom the formant frequencies of normal
speech.
Acknowledgments
Stefan Paul i is acknowledged for h i s assis tance in computer program-
ming. The work was supported by The Bank of Sweden Tercentenary
Foundation.
References
CARLSON, R. , FANT, G. , and GRANSTROM, B (1973): "Two-formant models, pitch and vowel perception", paper presented a t the Symp. on Auditory Analysis and Perception of Speech, Leningrad 1973; to be publ. by Academic P r e s s , London 1975.
FANT, G. (1973): Speech Sounds and Fea tures , MIT-Press , Cambridge, Mass . , p. 36 and p. 84.
FLANAGAN, J. L . (1955): "A difference l imen for vowel formant f r e - quency", J.Acoust.Soc.Am. - 27:3, p. 613.
LINDBLOM, B. (197 5): "Experiments in sound structure", paper presented a t the 8th Int. Congr. of Phonetic Sciences, Leeds, Aug.
PLOMP, R. (1970): "Timbre a s a multidimensional attribute of complex tones", in Frequency Analysis and Periodicity Detection in Hearing, ed. by R. Plomp and G . F. Smocrenburg, A. W . Sijthoff, Leiden, p. 397.
SLAWSON, A. W. ( 1968): "Vowel quality and musical t imbre a s functions of spectrum envelope and fundamental frequencyt1, J. Acoust. Soc. Am. 43:1, p. 87. 7
STUMPF, C. (1926): Die Sprachlaute, Springer, Berlin, pp. 77-85. I SUNDBERG, J. ( 1970): "Formant s t ructure and articulation in spoken and
sung vowels", Fol. Phon. 22: 1, p. 28. - SUNDBERG, J. (1972): "Pitch of synthetic sung vowels", STL-QPSR
1/1972, p. 34. ! SUNDBERG, J. (197 5): "Formant technique in a professional female
singer", Acustica 32:3, p. 90. -