Realtime Pitch

Click here to load reader

  • date post

    04-Jun-2018
  • Category

    Documents

  • view

    243
  • download

    0

Embed Size (px)

Transcript of Realtime Pitch

  • 8/13/2019 Realtime Pitch

    1/6

    Real Time Speech Classification and Pitch DetectionJONATHAN A. MARKS, MEMBER IEEE.

    INTERNATIONAL DIGITALCORPORATIONP.O. Box 6270Johannesburg

    2000

    Absrrocl An accurate silence-unvoiced-voiced classification an d pitch debctio nalgorithm is described and its implementation for real time applications on a TexasInstrumentsTMS320C2 5 digital signal processo r is evaluated. Speech classifica tionis separated into silence detection and voiced-unvoiced classification. Only the sig-nals energy level and zero crossing mte are used in both classification processes.Pitch detection need only.operate on voiced periods of speech. An elaborate peakpicking technique is used to successively home in on the peaks that bound the pitchperiods. Tests are performed on the found peaks to ensure that they are pitch per-iod peaks. A real time implementation strategy is developed that combines silencedetection with the signal acquisition and tightly couples voiced-unvoiced classifica-tion with pitch detection. The silence detection task is interrupt driven and the pitchdetection task loops continuously. The execution speed and accuracy results scoresfor this algorithm are shown to compare favourablywitb those for other such algo-rithms published in Rahiner et al. [3].

    I. INTRODUCTIONSPEECH CLASSIFICATION (silence, unvoiced and voiced) andpitch detectio n are fundamental to many digital speech processingapplications, such as speaker and speech recognition, and bit rate r e-duction for storage and transmission. Many such algorithms exist[114161 Most are t oo computationally intensive for practical real tim eapplications.The faster executing algorithms usually do not exhibit theaccuracy required by many real time speech processing applications.Jayant [171, considers real time implementations feasible if they canbe accomplished with a single microprocessor.This paper describes a speech classification and segm entation alg-orithm im plemented for real time processing on a Texas InstrumentsTh4S32OC25 Digital Signal Processor and evaluates its accuracy andspeed performance. This section examines certain speech signal char-acteristics and concludes with a discussion of the pro blem s associatedwith speec h classification and pitch detection. Section I1 reviews somespeech classification and pitch detection techniques and selects thoseused in this algorithm. In Section 111, the real time imple ment ationstrategy for the algorithm is developed. Section IV describes the sel-ection of speakers, the spoken messages and how they were rec orded.Section V evaluates the accuracy of the algorithm by exam ining itssilence detection, voiced-unvoiced classification, and pitch detectio nperformance. These results are compared to those obtained for othersuch algorithms found in R abiner e t al. [3]. In Section VI the execu-tion speed performance of the algorithm is evaluated. Section VI1concludes with a discussion on the major results obtained from theevaluation of the algorithm.A. Speech S i pa l CharacteriFtics

    Speech is the most natural form of communication known to m an,yet it is one of the most complex signals to analyse or m odel. Segmentsof speech can be described (or classified) in term s of the sounds theyproduce. Broadly, there a re four catagories:Silence.Unvoiced utterances.Voiced utterances.Plosives.

    Silence is that part of the signal where no speech is present. Unvoiced sounds, such as he / di n salt, are created by air passing throughthe vocal tract without the vocal cords vibrating. V oiced sounds, suchas the /AH/ in and, are cr eated by air passing through the glottis caus-ing it to vibrate. Plosives, such as the /b/ in bee, are cr eated by thesudden release of air from a constriction created in the vocal tract.Unvoiced speech exhibits low signal energy, no pitch, and a fre-quency spectr um biased towards the higher freque ncies of the audioband, normally peakingat four to five kilohertz. Voiced speech, on theother hand, has greater signal energy, pitch, and a spectru m biased to-wards the lower frequencies. Silence (background noise) is assumedto have a flat frequency response and low energy levels. Plosives aretransient in nature with relatively high energy levels and low frequencycontent.B Problems n Speech Clawfication nd Pitch Detection

    It is generally recognised that accurate silence-unvoiced-voicedclassification and pitch dete ction is extremely difficult for the follow-ing reasons:Unvoiced and voiced speech usually overlap and flow into eachother, making it nearly impossible to pin-point the exact timeswhere these transitions take place.Speech occasionally produces waveforms that ca nnot b e classi-fied either voiced or unvoiced, even by visual inspection of amp-litude time curves.Low level voiced speech could be easily confused with unvoicedspeech especially when the formant frequencies are high andprominent.Th e glottal excitation waveform of voiced speech is not a perfecttrain of periodic pulses. It varies in frequency, amplitude andshape.The s hape of the glottal excitation waveform is changed by theinteraction of the vocal tract. The vocal tract cr eates form ant fre-quencies and other harmonics. If these forma nts and harmonicscontain comparable energy to the pitch frequency energy, theyinterfere with the pitch detection criteria and can be confusedwith the pitch frequency.Th e vocal tract is continuously changing shape, and often buildsup constrictions which block th e flow of air through it. When theair is released, it rushes out of the vocal tract, creating a plosive.Plosives have similar am plitude characteristics to voiced speech,but, like unvoiced speech, have no pitch. Therefore t is possiblefor plosives to be confused as either voiced or unvoiced speech.

    11. SPEECH CLASSIFICATION AND PITCH DET ECTION TE CHNIQUESIn his section, various techniques used to classify speech and loc-ate the pitch period in voiced speech are discussed. The review oftechniq ues is not exhaustive and is biased towards those m ethods thatlend themselves to real time implementation.

    COMSIG 88 PRETORIA TH0219 6/88/0000 0001 [email protected] 1988 IEEE 1

  • 8/13/2019 Realtime Pitch

    2/6

    A. Speech Clarsijication Techniquestegories:Speech signal classification echniques can be divide d into two ca-

    Those that apply directly to the pitch detection problem, ie.speech is usually classified unvoiced wh en th e pitch canno t beThose that extract parameters from th e speech signal and attemp tto recognise patterns. That is, they are based o n differences instatistical distributions and/or ch aracteristic features of silence,unvoiced and voiced utterances [18], [19], [20], [Zl] [22].

    To continuously classify speech ac curately, the algorithm mustadapt to the inco ming speech signal. It is difficult to make the m eth-ods in the form er category adaptive to classificationas heir primaryconcern is pitch detection. It is generally easier to make the tech niquesin the latter category adapt to the incoming signal.

    found PI, [71,[91, [Ill, [W.

    The most commonly used speech classificationparameters are:Relative energy level.Zero crossing rate.First autocorrelation coefficient.First linear predictor coding (LPC) predictor coefficient.

    0 Normalised prediction error.The relative energy level of voiced speech is usually higher tha nthat of unvoiced speech and silence. The zero crossing rate is ameasure of frequency content in the signal. Unvoiced speech ex hibits

    a higher zero crossing rate than voiced speech or silence. Voicedspeech does not change as rapidly as unvoiced speech. Thereforevoiced speech often has a higher first autocorrelation coefficient anda smaller prediction error tha n unvoiced speech. The first LPC coef-ficient, which is the cepstrum value of the signal at one unit de lay, issmaller for voiced speech than for unvoiced speech.Atal and Rabiner [22] apply pattern recognition techniques usingprobability theory to the five speech parameters. Mwangi and Xydeas[IS] apply Fuzzy Set theory to th ese five speech parameters, and usea pattern recognition approach to classify speech signals. These tech-niques do not lend them selves to real time processing because of theirhigh computational intensity.The adopted classificationapproach only uses the relative energylevel and zero crossing ate of the speech signal. These param eters ar ebased on time domain measurements and require the least computa-tional overheads to b e extracted from t he waveform. By using thesetwo parameter s and making the algorithm adapt to the conversation

    level of the signal, it is possible to achieve the desired accuracy.In the impleme nted algorithm, speech classification is dividedinto two separate tasks:Silence Detection.Voiced-Unvoiced Classification.To detect silence, the relative energy level is a useful parameterto determ ine whether a spee ch signal is present or not. Davis [23] andSims [21] have both implem ented silence detection algorithms usingonly the signal energy level. Davis recomm ends the use of the zerocrossing rate to d etect silence periods accurately. Rabiner and Schafer[l] propose a silence detection algorithm using both th e relative en-ergy level and zero crossing rate. The zero crossing rate is requ ired todiscern between low amplitude fricatives and silence periods; bothcould have sim ilar energy levels but fricativeshave a higher frequencycontent.In unvoiced-vo iced speech classification, it some times happensthat voiced speec h is classified unvoiced and vice versa. This is due tothe overlap in th e relative signal level and zero crossing rate valuesthat is used to classify the s peech signal. By using the princi ple of hys-teresis and delayingconfirmation of the classificationuntil more of the

    incoming signal has been considered, this problem is largely over-come.Fig. 1 provides a graphical ex