CepStrum Noll 67

7/29/2019 CepStrum Noll 67

1/17

Received 24 August 1966 9.3, 9.8, 9.9

Cepstrum Pitch DeterminationA. MICHAEL NOLL

Bell Telephone aboratories, urray Hill, New Jersey07971

The cepstrum, defined as the power spectrumof the logarithm of the power spectrum, has a strong peakcorrespondingo the pitch periodof the voiced-speechegment einganalyzed.Cepstrawere calculatedona digital computerand were automaticallyplotted on microfilm.Algorithmswere developed euristicallyfor picking hosepeakscorrespondingo voiced-speechegments nd the vocalpitch periods.This informa-tion was then used o derive he excitation or a computer-simulatedhannelvocoder. he pitch quality ofthe vocoded peechwas udgedby experiencedisteners n informal comparisonests o be indistinguishablefrom the original speech.

INTRODUCTION

OICED-speechoundsesultromhe esonantaction of the vocal tract on the periodicpuffs ofair admitted through he vocal cords.For pitch-perioddetermination, he time periodicityof the source ignalmust be obtained from the observedspeech signal.Also, voiced-unvoiceddecisions equire accuratedeter-mination of the presenceor absenceof such periodicpuffs in the source signal. This deceptively simpleproblem has been the object of considerableesearchover the past few decades.Aside from its obvioususein analysisof speech oundsrom a pure research tand-point, an accuratepitch detectormust alsoperform ad-equately as an integral part of most speech-bandwidthcompression chemes. he designof an accuratepitchdetector that works satisfactorily with band-limited,noisy speechsignals remains one of the challengingareasof speech rocessingesearch.In a previouspaper, a new method or obtaining hefundamental requency r pitch of humanspeech asde-scribed?Since he logarithm of the amplitude spectrumof a periodic ime signalwith rich harmonic tructure sitself "periodic" in frequency, he new method con-sistedof spectrumanalyzing this log amplitude spec-trum. Adopting some new terminologyproposedbyTukey, the methodwas called"cepstrum"pitch detec-tion, where he term cepstrum efers o the spectrum fthe log-amplitudespectrum.Computer programswerewritten to perform short-time cepstrum analyses f

1 A.M. Noll, "Short-Time Spectrum and 'Cepstrum' Tech-niques or Vocal-Pitch Detection," J. Acoust. Soc. Am. 36, 296-302 (1964).

speech, nd the resultantpitch informationwas usedto obtain the excitation for computer-simulated o-coders.The synthesized peechwas quite encouragingas demonstrated y tapesplayed at the sixty-seventhmeetingof the AcousticalSociety?'The early computer rogramswritten to simulate hecepstrumanalyzer have sinceundergone number ofchanges owardssimplicity and efficiency. he resultsof analysesof speechcepstrawere used to designanautomatic method for determining the pitch periodsfrom the cepstralpeaks.This automatic peak picker,thoughnot previouslydescribed,was used to obtainthe excitation signals or the computer-simulated o-coders.Some nterestingand unexpected itch fluctua-tions and pitch doublinghave been discovered uringthe observations f speech epstra equired o developthe algorithms for the cepstral peak picker. Thesetopicsand new approacheso explaining nd justifyingcepstrum itch determinationwerenot reported n theprevious apers;and now s alsoa good ime to presentthe historical background eading to the concept ofshort-timecepstrum nalysis or vocal-pitchdetection.This paper treats all these opicsand concludes ithdescriptionsf some ossible ardware mplementationsof cepstrumanalyzers.

I. HISTORICAL BACKGROUNDSIn the fall of 1959,Bogert (of Bell Telephone abo-ratories) noticed banding in spectrograms f seismicsignals.He realized that this banding was causedby2 A.M. Noll and M. R. Schroeder,"Short-Time 'Cepstrum'Pitch Detection,"J. Acoust.Soc.Am. 36, 1030 (1964).

The Journalof the AcousticalSocietyof America 293

Downloaded 27 Feb 2013 to 67.20.202.101. Redistribution subject to ASA license or copyright; see http://asadl.org/terms


2/17

A.M. NOLL

"periodic" ipples n the spectra nd that this waschar-acteristicof the spectraof any signalconsisting f itselfplus an echo. The frequencyspacingof these ripplesequals he reciprocalof the difference n time arrivalsof the two waves.Tukey (of both PrincetonUniversityand Bell TelephoneLaboratories)suggestedhat thisfrequencydifferencemight be obtainedby first takingthe logarithm of the spectrum, thereby making theripplesnearly cosinusoidal. spectrumanalysisof thelog spectrum hen couldbe performed o determine he"frequency" of the ripple. In early 1960, Bogert pro-grammedTukey's suggestion n a computerand pro-ceeded to analyze numerousearthquakesand explo-sions.Tukey, noticing similaritiesbetween ime seriesanalysisand log-spectrum eriesanalysis, ntroducedanew set of paraphrased erms.The spectrumof the logspectrumwas called he "cepstrum,"and the frequencyof the spectral ippleswere referred o as "quefrency."Bogert, Tukey, and Healy published heir ideas n anarticle with perhapsone of the weirdest itles ever en-countered n the scientific iterature: "The QuefrencyAlanysisof Time Series or Echoes:Cepstrum,Pseudo-Autocovariance, ross-Cepstrumnd SapheCracking. MIn the article, they very clearly expressed pessimisticview for achieving adequate classificationof seismicevents by cepstral techniques. n fact, no definitiveindication of focal depth was found.

IMAGINARYPART OF SPECTRUMI ADDER'

SIN (0'7'

-I


3/17

CEPSTRUM PITCH DETERMINATIONII. CESTRUM-PITCH DETERMINATION

In its most basic form, the system for producingvoicedspeech ounds onsists nly of the vocal sourceand the vocal ract asshown n Fig. 2. The source ignals(t) is the periodicpuffs of air admitted through thevocal cords.The effectof the vocal tract is completelyspecified y its impulse esponse(t) such hat the out-put speech ignal (t) equals he convolution f s(t) andh(t). Alternatively, f S(w) is the spectrumof the vocalsourceand H (co) s the transfer unction or spectrumofthe vocal tract, then the spectrumof the speech ignalequals the product of S(w) and H(w). Expressedalgebraically,

f(t)-s(t),h(t), (1)(2)

withF (w)= 5f (t)-], (3)

(4)

where denotesconvolution, 5 denotesFourier trans-formation, and the Fourier transformsof s(t) and h(t)are assumed to exist.The sourcesignal and, therefore, he speech ignal,are quasiperiodicor voiced-speechounds.f the periodis T seconds,hen the powerspectrum F(w)] 2 of thespeechsignal consistsof harmonicsspaced T - Hz.Thus, the power spectrumof a voicedspeech ignal s"periodic" along the frequency axis with "period"equal to the reciprocal f the periodof the time signalbeing analyzed. The obvious way to measure this

"period" in the powerspectrum s to take the Fouriertransformof the spectrum hat will have a peak corre-sponding o the "period." This spectrumof the powerspectrums more commonly nownas the autocorrela-tion function of the original time signal. Mathemati-cally, the autocorrelationunction (r) is defined sr(r)--= E F (w)12-]. (6)

The speech owerspectrum quals he productof thespectraof the vocalsource nd the vocal tract. But theFourier transformof a productequals he convolutionof the Fourier transformsof the two multiplicands.Thus,r(r)=i7E S(w)[2 H(w) 123 (7)

= [ls ) l [ (8)=r,(r)*rh(r), (9)

where ,(r) and rh(r) are the autocorrelationunctionsof s(t) and h(t), respectively. he effectsof the vocalsource and vocal tract are therefore convolved with eachother in the autocorrelation functions. This results inbroad peaksand in somecasesmultiplepeaks n theautocorrelation function; thus, an autocorrelation

VOCAL ,q TRACTSOURCE(t) hVOICED

o SPEECHf (t)= s(t), h(t)Fro. 2. Basic system or the productionof voicedspeechsounds.h(t) is the impulse response f the vocal tract.

approach to pitch determination is, in general,unsatisfactory.The solution is to devise a new function in which theeffectsof the vocal sourceand vocal tract are nearly in-dependent or easily identifiable and separable. TheFourier transform of the logarithmof the power spec-trum is sucha new function and, indeed, separates heeffects of the vocal source and tract. The reason for thisis that the logarithmof a productequals he sum of thelogarithmsof the multiplicands'

loglF) [-1og[l$()] . l() [ ] (0)-log I8)l +log IHe)I . ()

The Fourier transformof the logarithm power spectrumpreserveshe additive property and is

The source and tract effects are now additive ratherthan convolvedas in the autocorrelation.The import-anceof this can be intuitively explainedwith the assist-anceof Fig. 3. The effectof the vocal tract is to producea "low-frequency" ripple in the logarithm spectrum,while the periodicityof the vocal sourcemanifests tselfas a "high-frequency" ipple n the logarithmspectrum.Therefore, the spectrumof the logarithm power spec-trum has a sharp peak correspondingo the high-frequency ource ipples n the logarithmspectrum nda broader peak corresponding o the low-frequencyformant structure n the logarithm spectrum.The peakcorrespondingo the sourceperiodicity can be mademore pronouncedby squaringthe secondspectrum.This function, he squareof the Fourier transformof thelogarithm power spectrum, s called the "cepstrum,"borrowingTukey's terminology.To prevent confusion etween the usual frequencycomponents f a time function and the "frequency"ripples n the logarithmspectrum, ukey has used heparaphrasedword quefrencyn describing he "fre-quency"of the spectral ipples.Quefrencies ave theunits of cyclesper hertz or, simply, seconds. doptingthis terminology, he cepstrum onsists f a peak occur-ring at a high quefrency qual to the pitch period nseconds nd low-quefrencynformation correspondingto the formant structure n the logarithm spectrum.Thus far, no mention has been made about the timelengthof the signal nderanalysis. s mentioned efore,for seismic ignals,a singlecepstrumanalysis s per-formed for the whole seismicevent. But speechparam-

8M. R. Schroeder, "Vocoders' Analysis and Synthesis ofSpeech," roc. EEE 54, No. 5, 720-734 (1966).The Journalof the Acoustical ociety f America 295



4/17

A.M. NOLL

oi

r

I I IFREQUENCYHz)

I

QUEFRENCY'SECONDS)

Fz6. 3. Logarithm power spec-trum (top) of a voiced speechsegmentshowinga spectralperi-odicity resulting from the pitchperiodicity of the speech. Thepower spectrumof the logarithmspectrum,or cepstrum bottom),therefore has a sharp peak corre-sponding to this spectralperiodicity.

SPEECHSIGNAL" / (k I) Tj

I HAMMING TIMETIME

WINDOW

-T w + T w

K,TM SHORT-TIME SPECTRUM:=- TW

K,TM LOG POWER SPECTRUM:

tH SHORT-TIM CPSTRUM: I(q) oG (o)la cos q dq0

FIG. 4. Basicoperationsequired or obtaining heshort-timecepstrumof a speech ignal.The ham-ming time windowof length Tw secmoves n jumpsof Tj sec.

296 Volume41 Number2 1967



5/17

CEPSTRUM PITCH DETERMINATION

eters--and, in particular, pitch--change with time;thereforea seriesof cepstra or short time segments fthe signalare required.This is accomplished y multi-plying the time signalby a function that is zero outsidesome inite time interval. The function performssome-thing like a window through which the time signal isviewed, and its effectsare discussedater in more detail.As shown n Fig. 4, the time-limited signal s spectrumanalyzedonce o obtain the log spectrum nd then againto produce the cepstrum. A new portion of the timesignal hen enters he window and is similarly analyzedto produceanother cepstrum.This process,when per-formed repetitively, results in a series of short-timecepstra.The time window, f desired,couldalso ook atoverlappingportionsof the signal.The resultantcepstraare automaticallyexamined odetermine he maximum peakscorrespondingo voicedspeechntervals and the frequencyof thesepeaks.Thisinformation s used to decide f the speechsegment svoicedor unvoiced nd, f voiced, o determine he pitchperiod.Both the effects of the time window and a mathe-matical justification for the spectral ripples were ne-glected n the precedingdiscussion nd are now takenup. The time-limited signal to be analyzed s

g t) = [-s t),h (t)-].w t) (13)from Eq. 1, wherew(t) is the time window,defined obe zero or It[ > Yw. But, the periodic ource ignal (t)can be represented s the superposition f an infiniteseries f identicalsignals o(t) epeatedevery T seconds:

s(0: E x0(t-.r)=so(t), E a(t--nr). (15)

Substitution nto Eq. 13 givesg(0 = {Es0(t), E (16)

The Fourier transformor complexspectrumG(w) ofg(t) is060) = s0() E a - /) ,we) (17)

= 2; z/(oo)$o(oo) o-, ,w (18)

W (w), (19)where So(w), H(w), and W(co) are the Fourier trans-formsof so(t),h (t), and w(t), respectively.

The resultsof the preceding how hat if the originalspeechsignal s(t),h(t) is not time-limited, then thecomplexspectrumconsists f an infinite seriesof im-pulsesspacedT - Hz and with amplitude tt(n2rr/T)XSo(n2rr/T). If the non-time-limitedsignal is band-limited, the complex spectrum would be frequencylimitedor zero or Icol comax.he effect f time imitingthe speechsignal with a multiplicative time windoww(t) is a convolutionof the corresponding pectralwindow W(co)with the spectral mpulsesof the non-time-limited complexspectrum.Thus, the impulses rebroadened nd assumehe shapeof W (co). he complexspectrum is now no longer frequency-limited, sinceW(co) s the transform of a time-limited function and,therefore, cannot be zero over any finite frequencyinterval. Hence, the complex spectrum s not strictlyfrequency-limited,but can be describedas being ap-proximately requency-limitedf W(co)has very smallside lobes. Also, the main lobe of W(co) determinesthe spectral resolution,and thereforea W(co) withlow-amplitude side lobes and a narrow main lobe isrequired. Although these requirementsare mutuallyexclusive,a good compromise s the hamming timewindow,

w(t) =0.54+0.46 cos(rrt/Tw)=0; [t >rw.

l tl rw (20)The hamming spectral window has a maximum sidelobe 44 dB below ts peak response.

III. NUMERICAL COMPUTATION OF CEPSTRAThe Fourier transformF(co)of some unction of timef(t) is definedas

F(co)= f(t)e-;tdt. (21)If f(t) is time limited by some multiplicative timewindoww(t) such that w(t)=0 for I tl > rw and ifcomplex exponentiation s separated into real andimaginaryparts, Eq. 21 becomes

f_w) = (0/(t) cos t)Tw-j _ w(t)f(t)in(cot)d22)wFurthermore,sinceF(co) has a time-limited transform,namely, w(--t)f(--t), then by Nyquist's samplingtheorem applied to the frequency domain, cocan berepresented s co=raAco, here Aco


6/17

A. M. NOLL

40 DECIBELST

0 t 2 3 4 0 3 6 9 12 15FREQUENCYkHz) QUEFRENCYmSEC)Fro. 5. Short-time logarithm spectra (left) and short-timecepstra (right) for a male talker (L.G.) recordedwith a condensermicrophone. he 40 msec-long amming ime moved n jumpsof10 msec.

40 DECIBELS

0 I 2 3 4 0FREQUENCY (kHZ) 3 6 9 t2 15QUEFRENCY (mSEC)Fro. 6. Short-time logarithm spectra (left) and short-timecepstra (right) for a male talker (F.L.C.) recorded rom a 500-type telephone et with carbonmicrophone.

so that F (w) becomesF (mAw) At Y'. w 1At)f(1At) oslmAtAw)l----L

L-jat 5-'. w(1At)f(1At)sin(lmAtAw),where L= TwAt.

(23)

This equation ed to the conceptof a delay ine forstoring2L+ 1 samples f the input signal sampleandhold circuitsat the taps of the delay line) so that thesignal being analyzed remains constantduring theanalysis windowmultipliers, unctiongeneratorsorcosine nd sine, and addersas shown n Fig. 1). Thereal and imaginaryparts of the spectrum roduced y208 Volume 41 Number 2 1967



7/17


4O DECIBELST

o 1 z s 4 o s e 9 t2 15FREQUENCY (kHz) QUEFRENCY (mSEC)FIe. 7. Short-time logarithm spectra (left) and short-time

cepstra (right) for a female talker (S.S.) recordedwith a con-densermicrophone.

this sampled-dataspectrumanalyzer are squaredandadded to generate he power spectrum.The logarithmof the power spectrum s usedas the input to a similarpower-spectrum nalyzerwhoseoutput is the cepstrum.This sampled-data analyzer was simulated on anIBM-7094 digital computerby using he BLODI com-piler. The input speech o the computer was band-limited to 4 kHz, sampled very 10 4 secs, nd digitized;

I

40 DECIBELS

0 t 2 3 4 0 3 6 9 12 15FREQUENCY kHZ} QUEFRENCYm..SEC)Fro. 8. Short-time logarithm spectra (left) and short-timecepstra (right) of "(scr)eaming," spoken by a female talker(S.S.) and recordedwith a condensermicrophone.A doubling npitch period occurs t the end of the utterance.

the time window extended from --15 to [ 15 msec. Theresults reported in the previous paper were obtainedwith this computersimulation,which consumed early2 h of computer ime to analyze only 2 sec of speech.The Journalof the AcousticalSocietyof America 299



8/17

A.M. NOLL

The programwas extremelyunwieldy and changesnany parameterswere difficult. Obviously,somestream-lining of the programwasrequired f further progressncepstrum-pitch etectionwere to be accomplished.Only a single pectrumn a series f spectras definedby Eq. 22. If the time window moves n jumps ofTj sec, then the kth short-timespectrumFk(m) is de-fined as

of log F(co)s simply gives log[F(c0)s. Thus, theFourier transformof C(r) equals he convolutionoflog F(co)Iswith itself.Since og F(co)I is very smallfor Ico>coo,he convolutions very nearly imited othe interval co _


9/17

FIG. 9. Speechwaveformof the "ing" portion of"(scr)eaming," showing heapproximate ocation of theswitch to double pitchperiod indicated by thecepstra.


o SEC L. .] 0.1 SEC10 MSECAPPROXIMATE BEGINNING OF DOUBLEPITCH PERIOD INDICATION IN CEPSTRA

0.1 SEC 0.2 SEC

0.2 SEC 0.3 $EC

from a 500-type telephone et with carbon ransmitter;Fig. 7 is femalespeech S.S.) recorded rom a condensermicrophone. n all three examples, he voiced-speechintervalsare clearly ndicatedby the sharppeaks n thecepstra. The cepstral peaks in Fig. $ for the voiced-speechntervals of Curves 11-15 are particularly inter-esting since they consist of a major peak with twosmallerpeakson either side.This occurredbecause hepitch was changing rapidly such that each 40-msecanalysis interval contained different pitch periods.Actually, the 40-msechammingwindow ooksmostly atonly the center 20 msec since the tails of the windoware strongly weighted down in amplitude. Thus, verylittle smoothing s actually present, and the largestcepstralpeak correspondso the dominantpitch periodmostly within the 20-mseccenter nterval.Figure 8 shows he spectra and cepstra of the utter-ance (scr)eaming pokenby a female talker (S.S.) intoa condensermicrophone.At about the 12th cepstrum,a second"rahmonic" appearsand gradually grows namplitudeuntil, at about the 17th cepstrum, ts ampli-tude exceeds he fundamental peak at about 5.2 msec.The fundamental peak then disappears, eaving onlythe cepstral peak at 10.4 msec. This would imply adoublingof pitch period at the end of the ".-. ing"sound,and, ndeed,speech ynthesized ith the doubledexcitationsounds atural and compares etter with theoriginal than excitation that doesnot double n periodat the "-.. ing" portion. The spectra correspondingto this transition show the alternate harmonicsgradu-ally growing n amplitude until they fill in the gapsbetween he harmonics orrespondingo the ower pitchperiod.The actual speechwaveform s shown n Fig. 9,and the point of transition s indicated.Although thedoubling is discernible owards the end of the signal,the cepstrum gives an indication of doubling earlierthan would be determinedby visual inspectionof thew aveform.The spectraand cepstraof the word chase pokenby

a famale talker (B.M.) and recordedwith a condensermicrophone s shown n Fig. 10. The 12th, 13th, and14th cepstrahave small second ahmonicsat about 8.8msec that are smaller in amplitude than the funda-mental cepstralpeak at about 4.4 msec.However, the19th through 21st cepstrahave second ahmonicswithamplitudesexceeding he fundamental.This type ofdoubling of pitch period imbedded n voiced speechsoundswrong when usedas excitation or a vocoderandis thereforeconsidered s undesirable. he spectra orthe doublepitch consistof harmonics orrespondingothe 4.4-msec pitch period with interlaced harmonicsthat fade in and out across he spectrum.This type ofspectrum s caused y minute jitter in the pitch-pulsetiming.11 f the vocal sourcesignal s(t) is assumed oconsistof air puffs at , T-i- e, 2T, 3 T-i- e, , thens(t) = y. [.,0(t- 2.r)+ s00- 2,r-

=So(t)* Y',(28)

The Fourier transformof the summation ortioncorre-sponding o the jittered pulses s

But,Z a . (20).... 2T/

l+e (30)so that the power spectrumconsistsof impulsesevery1/2T Hz with an amplitude fluctuation of [-lq-cosw(Tq-). If there is not jitter, then =0; and, since[-l+coscoT-]=O or co= -/T)n (where n= 1,3,5, ..-),

u B. Gold and J. Tierney, "Pitch-Induced Spectral Distortionin ChannelVocoders," . Acoust.Soc.Am. 35, 730-731 (1963).The Journalof the Acoustical ociety f America 301



10/17

A. M. NOLL

40 DECIBELS

0 i 2 3 4 0 3 6 9 t2. '15FREQUENCYkHz) QUEFRENCY m $EC)Fro. 10. Short-time logarithm spectra (left) and short-timecepstra right) of "chase,"spoken y a female alker (B.M.) andrecordedwith a condensermicrophone. he 19th through 21stcepstra have second ahmonics hat exceed he fundamental andthat would result in an undesiredndicationof pitch-perioddoubling.

the odd harmonics isappear,hereby eaving mpulsesevery1IT Hz. However, f e is not zero, he spectrumstarts with impulses paced1IT Hz, but gradually

40 DECIBELST

0 t 2 3 4 0 3 6 9 2 i5FREQUENCYkHz) QUEFRENCYmSEC)Fro. 11. Short-time logarithm spectra (left) and short-timecepstra right) of "(o)be(y)," spokenby a male talker (R.C.L.)and recordedwith a condenser icrophone. he explosion ccursat the sixth spectrumand cepstrum.

impulses ppearat 1/2T-Hz intervalsand thenperiodi-cally ade n and out acrosshe spectrum. he jitter canbe calculated rom the frequency n the spectrumatwhich the amplitude of the impulsesare first equal,since at this frequency the Nth cosine wave withperiod1/(Tq-e) Hz has a maximumsituatedexactlybetween two adjacent mpulses.For the spokenwordchase, his occurredat 3 kHz correspondingo ane0.08 msec,which s smaller han the accuracyof onepreviousmeasurement f pitch perturbations?The spectraand cepstrashown n Figs. 11-13 are fora male speaker (R.C.L.) recordedwith a condensermicrophone. hese speechutteranceswere chosenbyO. Fujimura in his investigations t Bell TelephoneLaboratoriesof speech ounds. he first set of spectraand cepstra how he explosionn the wordobey occur-ing at the sixth ine of Fig. 11) as exemplified y a com-pletely ripple-free spectrum. Figure 13 shows thespectraand cepstra or the voiced fricative portion ofthe word razorat the sixth throughninth lines.

V. AUTOMATIC TRACKING OF CEPSTRAL PEAKSThe cepstral peaks correspondingo voiced speechintervals can easily be pickedvisually. However, these1. . Lieherman,"Perturbations n Vocal Pitch," J. Acoust. oc.Am. 33, 597-603 (1961).




11/17


4O DECIBELS

Q t 2 3 4FREQUENCY

I

0 3 6 9 12 t5QUEFRENCY (mSEC)

FIG. 12. Short-time ogarithm spectra (left) and short-timecepstraright) f"(b)abbled),"spokenya male alker R.C.L.)and recordedwith a condensermicrophone. he explosion ccursat the sixth spectrumand cepstrum.

peaksmustbe pickedautomaticallyf cepstrumech-niques re to be used n a pitchdetection cheme.hissection f the paperdescribesheheuristic evelopmentof an algorithmor picking he cepstral eak hat bestdescribeshe pitch of thespeechor that time nterval.The criterion of "best" was evaluated by using thepitch data as excitation f a computer-simulatedo-coder nd thencomparinghe vocoded peech ith theoriginalspeech.The examples f cepstrandicate hat the cepstralpeaks re clearlydefined nd are quitesharp.Hence,thepeak-pickingchemes to determinehemaximumvalue n the cepstrum xceedingome pecifiedhresh-old. Sincepitch periodsof less han 1 msecare notusually ncountered,he ntervalsearchedor the peakin the cepstrum s 1-15 msec.Since he cepstral eaksdecreasen amplitudewithincreasinguefrency, linearmultiplicative eightingwasapplied ver he 1-15-msecange. he weightingwas 1 at 1 msec and 5 at 15 msec. The Fourier transformof the power pectrum f the timewindow qualsheconvolution of the time window with itself,

5:[ W (w) 2]= 5:[W w)W (-w) ] (31)=w(t)*w(-t).

0 2 3 4FREQUENCY KHZ) o 3 6 9 12 15QUEFIRENCY mSEC)

Fro. 13. Short-time ogarithm spectra (left) and short-timecepstraright)of "(r)azor,"spoken y a male alker (R.C.L.)and recordedwith a condenser icrophone. he voiced ricativeoccurs t the sixth throughninth spectraand cepstra.Thus, the higher-quefrencyomponentsn the powerspectrumecreases he timewindow onvalved ithitself.Althoughhe mathematicsecomesnwieldyoran exactsolution,t is reasonableo expect he higher-quefrency omponentsn the logarithmof the powerspectrumo decreaseimilarly,hereby xplainingheneedof weighting f the higherquefrenciesn the cep-strum. The linear weightingwith range of 1-5 waschosenmpirically y usingperiodic ulse rainswithvaryingperiods s nput to the cepstrum rogram.The cepstral eaksat the end of a voiced-speechsegmentsually ecreasen amplitude ndwould allbelow hepeak hreshold.hesolutionsto decreasehethreshold y some actor 2) overa quefrencyangeof4-1 msecof the immediatelypreceding itch periodwhen tracking he pitch in a seriesof voiced-speechsegments.he thresholdeverts o its normalvalueover hewhole epstrumange fter heendof theseriesof voiced segments.There s also hepossibilityhat an isolated epstralpeakmight xceedhe hreshold,herebyesultingn afalse indicationof a voiced speechsegment. n fact,some solated lapsof the vocal cordshave beenob-servedas the causeof suchan isolatedcepstralpeak. Inany event,suchpeaksshouldnot be considereds




12/17

A. M. NOLL

INVESTIGATE PITCH DOUBLING

YES

PICK MAXIMUMPEAK IN INTERVAL OF

+ 0.5 MS OF[QUEFRENCF EAK]r

/ OFMAXIMUMEAKYES IN NTERVAL_WITHIN NO! +-1.0MSOFQUEFRENCY /t OF PEAKOFNTH / l SETHRESHOLD CEPSTRUMIsETHRESHOLDO/2NITIALALUE[ ITONITIALALUEI 1

[SECONDR_AHMON, ;ToIuMBALNE_o' As TCH [A/5AS TCHPEAKPEAK/r LINEARi WEIGHTINGF CEPSTRUM

PICK MAXIMUMPEAK OFWEIGHTEDCEPSTRUM

YES NO

YESL 'T ,2NITIAL,ul JTC'"Ni'I',LALUE

INVESTIGATE INVF

YESS':. YESESTAFfRTRA,IN6

QUEFRENCY OFPITCH PEAK ATNTH CEPSTRUMSET EQUAL TO

AVERAGE PITCHQUEFRENCY OF

N-lST AND N+lSTCEPSTRA

VOICED, /UNVOICEATNTH/ AT TH

!

PITCHTRACKINGI PITCHEAK

ATNTH AT

STOREMPLITUDEND UEFRENCYF)MAXIMUMEAKFIG. 14.Flowchartof the algorithm sedto decidef the Nth cepstrumepresentsvoicedspeechnterval.

304 Volume 41 Number 2 1967



13/17


voiced, and this is accomplished y disregarding nycepstralpeaksexceeding he threshold f the immedi-ately precedingcepstrum and immediately followingcepstrumndicateunvoiced peech. his means hat theimmediately ollowingcepstrummust be peak searchedbefore a decisioncan be made about the present cep-strum. Hence, a delay of one cepstrummust be intro-duced o eliminate his requirementof knowledge boutthe future. Before deciding about the "present" cep-strum, however, knowledge about the preceding andfollowing cepstrum s also required or the algorithmused to eliminate another problem, namely, pitchdoubling.An exampleof legitimatepitch doublingoccurred tthe end of the wordscreaming, s shown n Fig. 8. How-ever, the second ahmonicof a cepstralpeak sometimesexceeds he fundamental, and the second rahmonicshouldnot be chosen s representinghe pitch period.Thus, the peak picking algorithm should eliminatefalse pitch doublingcausedby a second ahmonicbutshouldalsoallow legitimatepitch doubling.For legiti-mate doubling, here is no cepstralpeak at a one-halfquefrency,but for erroneous oubling, here is suchapeak at one-half quefrency since this is the funda-mental. The algorithm capitalizesupon this observationby looking or a cepstralpeak exceedinghe thresholdin an interval of 4-0.5 msecof one-half the quefrencyofthe double-pitch eak. f sucha peak s found, hen t isassumed hat it represents he fundamental, and thedouble-pitch ndication s wrong. The threshold s re-duced by a factor of 2 if the maximum peak in the+0.5-msec interval falls within 4-1.0 msec of the im-mediately precedingpitch period. Pitch doublinghasoccurred whenever the cepstral peak exceeding thethreshold s at a quefrencyof >_1.6 times the imme-diately preceding itch period.A flow chart of the peak-pickingalgorithm s shownin Fig. 14. The algorithmdetermineswhether the cep-stral peak of the /Vth cepstrum represents voicedspeechsegment. nformation about the N--lth cep-trum is stored,and the/V-+-1 h cepstrum s peak pickedbefore decidingabout the/Vth cepstrum.The N-kithcepstrum s read n, linear weighting s applied,and themaximumpeak is picked. f the preceding wo cepstrarepresentedoiced-speechegments,henpitch trackingis in effect,and the threshold s reduced o its initialvalue f the quefrency f the peak s within q-1.0 msecofthe quefrencyof the pitch peak of the/Vth cepstrum.The previouslydeterminedpeak in the N-+-lth cep-strum is now compared with the threshold. Pitchdoubling s investigatedwhether the peak exceeds rdoes not exceed the threshold. Both cases are checkedsince he peak might represent itch doublingand yetnot exceed he initial value of the threshold. But, thefundamental eak couldstill exceed he -} initial valuethreshold. f the maximum peak exceedshe threshold,it is tentatively chosen s a pitch peak representingvoiced-speechegmentat the N-+-lth cepstrum.The

PITCH PERIOD

lo MSECSMOOTHED

PITCH PERIOD./ RAMP .//, ///{///{///i/l/i/l/j/// t/// t

Fro. 15. Method for deriving pitch pulses from pitch perioddata suppliedby cepstralpeak picker.

information about the N-+- th cepstrum and N-- 1 hcepstrum s then used to decide f the Nth cepstralpeak represents n isolatedvoicedsegmentor an iso-lated absenceof voicing in a seriesof voiced-speechsegments. he final result is an indication of whetherthe Nth cepstrum epresents voiced or an unvoicedspeechsegment. f the segment s voiced, the pitchperiod s alsogiven.A computer program was written to perform theoperations equiredby the algorithm.The voicing andpitch-period nformation were both printed on paperand written on magnetic apes or later processing.VI. VOCODER EXCITATION

The final judge of any vocal-pitchdetectionschemeis its ability to performsatisfactorilyn determining heexcitation for a vocoder. Vocodor excitation in the formof pitch pulsesduring voicing and white noiseduringnonvoicing hus had to be derived from the results ofthe cepstralpeak picking.The cepstralpeak picker produced wo outputs ondigital magnetic ape. The first tape containedvoicinginformation as two dc levels correspondingo a voicedor unvoicedspeech nterval. The levels were constantfor the 10-msec orrespondingo the speech ime jumps.The second ape contained he pitch period as dc levelsignals hat also were constant or 10 msec. These twotapes ormed the input to the excitationgenerator.The voicing and pitch-period signals are first eachsmoothed y a pair of 33-Hz low-passilters. The pitchpulsesare derived from the smoothedpitch signal asshown n Fig. 15 by runninga counterup until it equalsthe smoothed itch signal.An impulse s then emitted,and the counter s reset o zerobeforeagainstarting tscount. If the smoothedpitch signal is measured ntenths of a millisecond and the counter counts in tenthsof a millisecond, hen the timing between he emitted



14/17

SPEECH'NPUT

A. M. NOLL

ANALYZER

i FILTERISHORT-TMEICEPSTRAL[---m 34HzCEPSTRUM PEAK IANALYZERI PICKER ILOW-PASSI') FILTERVOIClNGI4Hz227HzFILTER .... FILTER -IHz w I- -1 I Hz,, ,

', ', ',D-P ,, CT JLOW-PASSI,FILTER ,,., FILTER2??Z W 34HZ

TRANSMISSION SYNTHESIZERRANDOMI NosEJ]PITCH-PULSEjj

,IGENERATORJ__ ADDER-..,,J I-,s I I IBAND-PASSlF,E I----IODULOI----I,L I"C0-227Hz wl I x I m 27HZgwJ m STRUCTED' ' , ' ' , ' II SECH' I I OUTPUTIg-*ssl I I-ssJJFILTER MODULATOR FILTER J/,Hzwl I x I I,Hzl t

Fro. 16. Block diagram of 13-spec-trum channel vocoder with excitationderived rom a cepstrum itch detector.

impulses quals he pitch period.The smoothed oicingsignal is used to control a double-throw switch forchoosing ither pitch pulsesor white noise as a finalexcitationoutput.This techniquewas devisedand simulated on thecomputer by M. M. Sondhi using the BLODI pro-gramming anguage.The output of the program wasstill anotherdigital magnetic ape, whichwas then usedas the excitation input to a 13-spectrumchannelvo-coderdesigned y Golden? The vocoderwas alsosimu-lated on the computerusing he BLODI programminglanguage.The spectrumchannel nformation was de-rived from a computer-simulatedocoder nalyzerand,togetherwith the excitation, hey formed the input tothe computer-simulatedynthesizer. he whole opera-tion from speech ignal o simulatedvocoderoutput isshown n Fig. 16. The digital computergeneratednu-merous visual outputs on microfilm including theshort-timespectraand cepstra, he voicingand pitch-period variations, the original speechsignal, and thevocodedspeechsignal. These visual outputs were ex-tremelyvaluable n devising he final versions f all thedifferentportionsof the chainmaking up the completepitch-detection cheme.The complete scheme, ncluding the vocoder, wasused o modify and mproveall portionsof the chainbycomparinghe vocoded peechwith the originalspeech.In particular,hepitch-periodoubling t heendof theword screaming as determined o be aurally correctby sucha comparison f originalwith vocoded peech.The syntheticspeech rom the computer-simulatedcepstrum-excitationhannel vocoder was comparedboth with the original speech nd with the syntheticspeech rom a computer-simulatedoice-excited o-coder and the same computer-simulated hannel vo-coder, but with the full-band speechas excitation.Althoughonly a few sentencespoken y four talkerswereused n these nformalpaired-comparisonests, he

la R. M. Golden, "Digital Computer Simulationof a Sampled-Data Voice-ExcitedVocoder," . Acoust.Soc.Am. 35, 1358-1366(1963).

306 Volume41 Number 1967

pitch quality of the channelvocoder with cepstrumpitch detectionwas judged to be excellentby expe-riencedvocodercritics.This optimismwas sufficient oinitialize constructionof a real-time cepstrum pitchdetector. 14VII. IMPLEMENTATION OF CEPSTRUM ANALYZERS

In its most basic orm, cepstrum-pitch etectionre-quires two spectrumanalyseswith logic circuitry forpicking the cepstralpeak correspondingo the pitchperiod of a voiced-speech egment.Thus, a means forperforming two spectrum analyses n real time is re-quired for a hardware mplementationof a cepstrum-pitch detector.The requirements f real-timeoperationand good requency esolutionn the spectrum nalyzersare somewhatdifficult to satisfyand have therefore e-suited n the correctopinion hat a hardwarecepstrumanalyzer would be difficult to construct.However, techniquesare available for performingreal-time spectrumanalyses hat couldbe adapted tocepstrum analysis. One such method performs thespectrum analysisby a circulating delay line with atime-variable phase shifter operating upon a hetero-dyned version of the time signal. This method, de-scribedby Bickel and Bernstein 5has beensuccessfullyusedby Weiss,Vogel, and Harris in an implementationof a cepstrumanalyzer. 6,17till anothermethod,similarto a spectrumanalyzerdescribed y Gill, usesa hetero-dyne filter operating on a time-sweptversion of theinput signal.s Kelly and Kennedy have utilized thisz4 . M. Kelly and R. N. Kennedy,"An ExperimentalCepstrumPitch Detector for Use in a 2400-bit/sec Channel Vocoder,"presented t the 72nd meetingof AcousticalSocietyof America(Nov. 1966), Paper 1H3.z5H. J. Bickel and R. I. Bernstein,U.S. Patent No. 3,013,209.16H. J. Bickel, "Spectrum Analysis with Delay-Line Filter,"IRE WESCON Cony. Rec. 1959 (Part 8), 59-67 (1959).z*M. R. Weiss,R. P. Vogel, and C. M. Harris, "Implementationof a Pitch Extractor of the Double-Spectrum-Analysisype,"j. Acoust.Soc.Am. 40, 657-662 (1966).zs . S. Gill, "A Versatile Method for Short-Term SpectrumAnalysis n 'Real-Time,'" Nature 189, No. 4759, 117-119 (14Jan. 1961).


15/17


method n yet another successfulmplementation lsoincluding ogic circuitry to track the cepstralpeak)4They have also derived vocoderexcitation rom theircepstra and have producedexcellent-quality ocodedspeechutilizing a completehardware system of cep-strum analyzer and vocoder.Both methods tilize analog-circuitechniques uringall or part of the spectrumanalysis.Digital techniques,however, have progressed o the state where a com-pletely digital implementation houldbe possible. heCooley-Tukeyalgorithmgreatly reduceshe numberofmultiplications nd additions, nd might be of practicaluse n sucha completelydigital cepstrum nalyzer.Another promising method utilizes the spectrumanalyzingpropertiesof a lens? 2A lens forms at itsfocal plane an image that is the Fourier transformofthe image at the object plane. Since his is a spatialFourier transform, the signal must be frozen in timewith light intensitymadeproportional o signalampli-tude. A coherentight source s required o illuminatethe spatial representation f the signal,and there aresome questionsconcerning he most efficient way toconvert he time signal nto sucha spatial signal.But,the technique seems particularly promising (sinceparallel processings very convenient), o that thou-sands f signals ouldbe analyzed lmost imultaneously.VIII. PSEUDO-AUTOCOVARIANCE OR CEPSTRUM?In their article in Rosenblatt's book, Bogert et al.,3define the cepstrumas "autocovariance nd Fouriertransformation . . [-of the log spectrum f the origi-nal process."Since he Fourier transformof the auto-covarianceof some unction is identical with the powerspectrum f the same unction, he cepstrum hould eequivalent o the power spectrumof the log powerspectrum f the originalprocess. urthermore, ince helog power spectrum s an even function of frequency,the. cepstrumshould equal the squareof the cosinetransformof the log power spectrum.Later in the article, Bogerr et al. define a pseudo-autocovarianceas "the Fourier transform of [-thelog..-power spectrum." The "pseudo" prefix islogicallyusedsince the Fourier trans.form f the non-loggedpower spectrum s the usual autocovariance.Thus, the cepstrumshould equal the square of thepseudo-autocovariance.ut, in their definitionof thecepstrum,Bogerret al. had meant to assume hat thelogspectrum xistedor all positiverequenciesprivatecommunication). s a result, their cepstrumequals hesum of the squares f the sine ransformand the cosinetransform of the log power spectrum. Stated mathe-9L. J. Cutrona, E. N. Leith, C. J. Palermo, and L. J. Porcello,"Optical Data Processing nd Filtering Systems," RE Trans.InformationTheory T-6, 386-400 (1960).20B. Julesz,A.M. Noll, and M. R. Schroeder, Optical Cep-strum Analysis" unpublished orandum).

matically, their definitionof the cepstrum sCBogertT)= {IFsin[lOgF(w)12-]}4-{oUoglF()lD} ', (32)where sin and oos denote Fourier sine transformationand Fourier cosine ransformation, espectively; (w)is the complex ourier transformof the originalprocess;and F(w)=O for w


16/17

A.M. NOLL

(a)

40 DECIBELS

(b) (c) (d)

0 I 2 3 4 0 3 6 9 12 15 0 3 6 9 12 15 0 3 6 9 12 15FREQUENCY (kHz) QUEFRENCY (msec) QUEFRENCY (msec) QUEFRENCY (msec)

Fro. 17. (a) Short-time ogarithm spectra, (b) pseudo-autocovariancequared, c) pseudoquadratureutocovariance quared,and(d) Bogert'scepstra (definedas the sum of the squares f the cosine nd sine ransforms f the logarithmspectra) or a male talker(F.L.C.) recorded rom a 500-type elephone andsetand with additivewhite noise sio'nal-to-noiseatio m 12 dB).

Of course,cepstrumpitch detection s insensitive onarrow-band white noise, since such noise would atmost obscureonly a few spectral ipples.Cepstrumpitch detectionhas to someextentchangedour over-all concept of a vocoder. Previously, mostdiagrams of a channel vocoder showed considerabledetail about the channel ilters while the pitch detectorwas usually shownas a small block at the bottom, al-though the pitch detector itself was sometimesquiteelaborate.However, the spectrum-channelnformationis obtainedas an intermediatestepduring he cepstrum-analysisprocess.Thus, our new conceptof a vocoderanalyzershows n involveddiagramof a pitch detectorwith the spectrum-channelnformationobtainedas aby-product! SeeFig. 18.) Perhaps his s more ealistic,


becauset has ong beenrecognizedhat accuratepitchinformation is the most challengingaspect of vocoderdesign.The spectrum-channelnformationhas perhapsbeen reduced o its true relative importance.But where does all this effort lead us?. It seems thatvocoderdesign s becoming onceptuallymore compli-cated with asymptotic, though not necessarilynsig-nificant, mprovementsn quality.The vocoder chemesand pitch detectors re becomingncreasingly xotic,as exemplified y cepstrumpitch detection.Also, suchnew speech ransmission ethods s microwave, atel-lites,and the promise f light communicationver aserbeamsmight someday hange he present estrictionson available bandwidth. The future of vocoders forspeech andwidth ompressionight seembleak.Why



17/17


Fl. 18. New concept of spectrumchan-nel vocoder n which the spectrumchannelinformation is obtained as a by-product ofthe cepsttumpitch detector.

PITCH ETECTORSPEECHNPUT SPECTRUM [x [ SPECTRUMo ANALYZER_., NALYZER PITCHPERIODS_ CEPSTRALEAK PICKERND ITCH [L RACKINGLGORITHMoOICING

SPECTRUMCHANNEL

INFORMATION

continue, hen, with vocoderdevelopment, nd--in par-ticular--why be concernedwith pitch detectors?Special-purposeocoders an be useful n removingcertain types of speechdistortion. For example, the"Donald Duck" quality of speech poken n the heliumenvironmentused in certain underwater quarters suchas Sealabcan be eliminatedby frequencyshifting of thevocoder channel signalsY The transmission f speechcanbe madeprivate or secure y the use of vocoders.An accuratepitch-detectionschemewould becomea2112 A/I .l,--l> "I ....... ;no' Naturalness and !nte!!igibi!ityof Helium-Oxygen SpeechUsingVocoderTechniques," . Acoust.Sac. Am. 40, 621-624 (1966).

very important tool in speech esearch y fostering e-search n pitch fluctuationsand patterns.Thus, furtherresearchand development of pitch detectors s war-ranted not only to producespeech andwidth-compres-sionvocoders ut alsoas a fundamental ool for speechresearch nd for special-purposeocoders.As mentioned reviously, epstrum nalysisperformsremarkably well as a vocal-pitchdetector.However, amore generalconclusion as evolved rom the conceptof cepstrumanalysis: that the spectrum tself can beregardedas a signaland can be processed y standardsignal-analysis echniques.With such a viewpoint,cepstrum analysis and other signal processing f thespectrumdo not seemquite so exotic.


CepStrum Noll 67

Documents

Transcript of CepStrum Noll 67