Download - Weighting of the Fin1

8/3/2019 Weighting of the Fin1

http://slidepdf.com/reader/full/weighting-of-the-fin1 1/10

.Weighting of the fine-structure for the perception of the mono-syllabic

speech stimuli in the presence of noise

Introduction:

Information in speech is redundant. For normal-hearing subjects, this means that the signal isrobust to corruption, and that speech remains intelligible under adverse listening conditions,

such as in high levels of background noise.

In the normal auditory system, a complex sound like speech is filtered into frequencychannels on the basilar membrane. The signal at a given place can be considered as a time-

varying envelope superimposed on the more rapid fluctuations of a carrier (temporal fine

structure, TFS) whose rate depends partly on the center frequency and bandwidth of the channel,

which is important for the perception of speech in noise.The relative envelope magnitude across channels conveys information about the spectral

shape of the signal and changes in the relative envelope magnitude indicate how the short-termspectrum changes over time, which is the key role player in the perception of speech in quite.

The TFS carries information both about the fundamental frequency (F0) of the sound

(when it is periodic) and about its short-term spectrum.

The bandpass signal at a specific place on the basilar membrane (or the signal produced bybandpass filtering to simulate the waveform at one place on the basilar membrane) can be

analyzed using the Hilbert transform to create what is called the“analytic signal” (Bracewell1986).

In the mammalian auditory system, phase locking tends to break down for frequenciesabove 4 – 5 kHz (Palmer and Russell, 1986), so it is generally assumed that TFS information is

not used for frequencies above that limit. The role of TFS in speech perception for frequencies

below 5 kHz remains somewhat unclear.



The upper limit of phase locking in humans is not known. Although TFS in the stimulus

on the basilar membrane is present up to the highest audible frequencies, this paper is especiallyconcerned with TFS information as represented in the patterns of phase locking in the auditory

nerve. This information probably weakens at high frequencies, and so one way of exploring the

use of TFS information is to examine changes in performance on various tasks as a function of

frequency.

Many studies have assessed the relative importance of TFS and envelope information for

speech intelligibility, for normal-hearing subjects. The challenge inherent in evaluating theindividual contribution of frequency-specific (place) and temporally coded (temporal) cues to

auditory perception typically arises from difficulty in decomposing an auditory signal (such as

speech) into a modulator (or envelope) and a carrier so that either can be “independently”altered, reduced or replaced.

One such method involves decomposition of the signal by means of the Hilbert

transform. This method will be referred to as the Hilbert approach. Although it has several

variants, it can generally be described as follows. A priori, it is assumed that a broadband signal,

S(t), can be described as the sum of N modulated bands, Sn(t), such as

(1)

where mn(t) and cn(t) are, respectively, the modulator and the carrier in the nth band. In order

to reduce possible confusion,the original modulator and carrier will always be referred to as m(t)and c(t), respectively. The computed envelope and phase (or temporal fine structure; TFS),

defined later on, will always be referred to as a(t) and cos/(t). From Eq. (1), it is clear that themodulator and the carrier could easily be manipulated separately. However, for an observed

signal such as speech, mn(t) and cn(t) are unknown, and therefore must be determined. By

introducing Zn(t), the analytic

signal defined by

_; (2)

where and H[ _ _ _ ] is the Hilbert transform, one can determine the Hilbertinstantaneous amplitude, an(t), and the Hilbert instantaneous phase, / n(t), respectively given by

so that the original signal can be rewritten as



It is commonly assumed that mn(t) _ an(t) and cn(t) _ cos / n(t), and thus one canmanipulate the envelope and/or the fine structure independently and synthesize amodifiedversion of the original signal.

Several recent studies, however, suggest that the Hilbert approach may be inappropriate

to decompose complex signals such as speech. It should be noted that this restriction is limited tothose situations where the envelope and/or the fine structure are manipulated (e.g., filtered) prior

to be added back together to synthesize a new signal.

Ghitza (2001) first suggested that part of the original envelope information can be recovered

from the Hilbert fine structure at the output of the auditory filters. The intelligibility of TFS-

speech may be influenced by reconstructed E cues.The reconstructed envelope cues make a contribution to the intelligibility of TFS-speech,

even though the envelope cues alone are not sufficient to give good intelligibility. The fact that

learning is required to achieve high intelligibility with TFS-speech may indicate that the auditory

system normally uses TFS cues in conjunction with envelope cues; when envelope cues are

minimal, TFS information may be difficult to interpret. Alternatively, the learning may reflectthe fact that TFS cues are distorted in the TFS-speech (relative to unprocessed speech), and it

may require some training to overcome the effects of the distortion.

Several behavioral (Zeng et al., 2004; Gilbert and Lorenzi, 2006) and neurophysiological (Heinzand Swaminathan, 2009) studies have since confirmed that envelopes derived from the TFS can

produce good speech intelligibility. In the behavioral studies, normal-hearing (NH) listeners

were presented with the TFS of speech stimuli or with a series of noise or tone carriersamplitude-modulated by the recovered envelopes.

In the latter case, a technique similar to vocoder processing (Shannon et al., 1995) was

used and the recovered envelopes corresponded to the outputs of a bank of gammachirp auditoryfilters (Irino and Patterson, 1997) in response to the original speech fine structure.

“Vocoder” processing has been used to remove TFS information from speech, soallowing speech intelligibility based on envelope and spectral cues to be measured (Dudley,

1939; Van Tasell et al., 1987; Shannon et al., 1995).

Aspeech signal is filtered into a number of channels ( N ), and the envelope of eachchannel signal is used to modulate a carrier signal, typically a noise (for a noise vocoder) or a

sine wave with a frequency equal to the channel center frequency (for a tone vocoder). The

modulated signal for each channel is filtered to restrict the bandwidth to the original channel

bandwidth and the modulated signals from each channel are then combined. For a single talker,provided that N is sufficiently large, the resulting signal is highly intelligible to both normal-

hearing and hearing-impaired subjects (Shannon et al., 1995; Turner et al., 1995;Baskent, 2006;

Lorenzi et al., 2006b). However, if the original signal includes both a target talker and a

background sound, intelligibility is greatly reduced, even for normal-hearing subjects (Dorman et

al., 1998; Fu et al., 1998; Qin and Oxenham, 2003; Stone and Moore, 2003), leading to the

suggestion that TFS information may be important for separation of a talker and background into

separate auditory streams (Friesen et al., 2001).



Zeng et al. (2004) found up to 40% correct performance for sentences and Gilbert and Lorenzi(2006) found up to 60% correct performance for consonants.

Gilbert and Lorenzi (2006) also showed that performance decreases with increasing number of

analysis bands. The authors attributed the effect of the number of bands to the ratio between thebandwidth of the analysis filters and that of the auditory filters. They also concluded that

consonant identification is essentially abolished when the bandwidth of the analysis filters is less

than or equal to four times the bandwidth of normal auditory filters.

As India is multi culture and multi linguist and most of them are bilingual ---- Indian English is

different from other countries due to the influence of the multi ligngism----- So there is dearth of

knowing to what extent the low frequencies contribute to the speech intelligibility in Indian

English--- and to know However how much of the low frequency hearing preservation is needed

for the perception of speech in noise is not explored

If this information can be known, it stands as a outcome measure prior to implantation . Hence forth the present study felt the need of Weighting of the fine-structure for the perception

of the mono-syllabic speech stimuli in noise (i.e) knowing whether high frequency or low

frequency fine structure information is required for speech perception in noise.

Methodology:

Subjects :

Age range : 18 to 25 yrs. (young Adults)



Control group: Normal hearing individuals (normal hearing as per ANSI

Criteria).

All should have normal hearing, defined as having audiometric thresholds of 20 dBHL (hearing level) or less at octave frequencies between 250 and 8000 Hz and

normal immittance measures and histories consistent with normal hearing.

Experimental group: Individuals with Moderately severe Sensory Neural

Hearing loss (post lingual deafness).

The hearing-impaired subjects were selected to have „„flat‟‟ moderate hearing

losses ,and they were divided into two groups: young (n_7; mean age_24; range:

18 – 25) and elderly (n _ 7; mean age _ 68; range:63 – 72), because there is some

evidence that the ability to use TFS decreases with increasing age .

Air-conduction, Bone-conduction, and impedance audiometry for the hearing

impaired subjects were consistent with sensorineural impairment. The origin of

hearing loss was unknown for all elderly subjects and was either congenital or

hereditary for the young ones. All impaired subjects had been fitted with a hearing

aid on the tested ear for _9 years.

Number of subjects: As many as possible with in the time constraints of the

data collection.

All subjects were fully informed about the goal of the present study and provided

written consent before their participation.

Stimuli to be used:

To overcome the bias because of the differences in the semantic knowledge of the

subject‟s and to check the efficiency of the technology, the speech material in thestudy consisted of 50 ISHA PB – words.

All the PB words will be produced by a female and a male speaker and recorded

using SLM and Adobe Audition at a 44100-Hz sampling rate in a sound proof

booth into the laptop.



Instruments to be used:

MATLAB 2010a for signal processing.

GSI 61(Dual channel) for presenting stimuli.

Stimuli Synthesis

Phase I :

Stimuli:Speech signals will be digitized (16-bit resolution) at a 44.1-kHz sampling

frequency; they will then be band-pass filtered using Butterworth filters (72 dB/oct

rolloff) based on the green wood frequency - function critical bands spanning the

range 80 – 8,020 Hz. The bands will be less than two times as wide as the

„normal‟‟auditory filters (44), and probably comparable to the widths of the

auditory filters of the impaired subjects ,thus ensuring that recovered E cueswould be minimal for both groups of subjects.

The use of these analysis bands also ensured that the amount of spectral

information provided by the E stimuli was similar for the normal-hearing and

hearing-impaired subjects.( :ref: Gilbert G, Lorenzi C (2006) J Acoust Soc Am

119:2438 – 244)

These bandpass filtered signals were then processed in three ways.

In the first (referred to as „„intact‟‟), the signals were summed over allfrequency bands. These signals contained both TFS and E information.

In the second (referred to as „„E‟‟), the envelope was extracted in each

frequency band using the Hilbert transform followed by lowpass filtering

with a Butterworth filter (cutoff frequency_64 Hz, 72 dB/oct rolloff).

The filtered envelope was used to amplitude modulate a sine wave with a

frequency equal to the centre frequency of the band, and with random

starting phase.

The 16 amplitude-modulated sine waves were summed over all frequencybands. These stimuli contained only E information.

In the third (referred to as „„TFS‟‟), the Hilbert transform was used to

decompose the signal in each frequency band into its E and TFS

components.



Procedure

All stimuli were delivered monaurally to the right ear via TDH 39 headphones.

The stimuli were presented to the normal-hearing subjects at a level of SRS and to

the hearing-impaired subjects to ensure that the stimuli were audible and

comfortably loud.

Condition I;

Each individual be presented with PB words both with and without noise.

Condition II:

Each individual will be presented with processed signal (without the TFS

information) of atleast of one band of frequencies, every time in ascending (low -

high) order of elimination band.



Response :

Oral repetition of the word would be expected.

Scoring :

A score of „0‟ for every wrong repetition and „1‟ for every correct repetition

would be allotted.

ROL :

Band width:

The function described in 1961 (Greenwood,1961b) hypothesized that critical bandwidth,

in Hz, might follow an exponential function:

CB c 10“ + h_ of distance, x (in any physical units or normalized distance), along the cochlearpartition, and correspond also to a constant distance on the basilar membrane.

The frequency-position function obtained as above (see Fig. l), is: Fletcher (1940. 1953),and Zwicker et al. (1957).



F=A(lOUX-k),where F is in Hz and x is in mm and where suitable constants (for man) are: A = 165 and a =0.06, the latter an empirical constant arising in the critical band function but found also to agree

closely with the logarithmic slope of BCkCsy‟s volume compliance gradient for the human

cochlear partition; and k, an integration constant left here at the original value 1, but that may

sometimes be better replaced by a number from about 0.8 to 0.9 to set a lower frequency limitdictated by convention or by the best fit to data.

Although the value k = 0.88 would yield the conventional lower frequency limit of 20 Hz

for man, I will continue to use 1.0 for man and most of the other species in this paper, excepting

the cat since Liberman (1982) has found that a k of 0.8 best adjusts this function to his lowfrequency data points in the cat.

NO. of Bands :

Traditionally, the spectral magnitudes have been regarded as of primary importance for

perception, although under some conditions, the phases of the components play an important

role. (Moore 2002).

The bandpass signal at a specific place on the basilar membrane (or the signal produced

by bandpass filtering to simulate the waveform at one place on the basilar membrane) can be

analyzed using the Hilbert transform to create what is called the“analytic signal” (Bracewell

1986).

Hilbert transform can be used to decompose the time signal into its envelope (E; the

relatively slow variations in amplitude over time) and temporal fine structure (TFS; the rapid

oscillations with rate close to the center frequency of the band)

Each filter was chosen to have a bandwidth of 1 ERBN, where ERBN stands for the

equivalent rectangular bandwidth of the auditory filter as determined using young normallyhearing listeners at moderate sound levels (Glasberg and Moore 1990; Moore 2003). The suffix

N denotes normal hearing.

Traditionally, the envelope has been regarded as the most important carrier of

information, at least for speech Signals.Both E and TFS information are represented in the timing

of neural discharges, although TFS information depends on phase locking to individual cycles of

the stimulus waveform (Young and Sachs 1979).

In most mammals, phase locking weakens for frequencies above 4–5 kHz, although some

useful phase locking information may persist for frequencies up to at least 10 kHz (Heinz et al.

2001).