Mel$frequency)cepstral)) coefficients)(MFCCs)...

16
SGN14006 Audio and Speech Processing Melfrequency cepstral coefficients (MFCCs) and gammatone filter banks Slides for this lecture are based on those created by Katariina Mahkonen for TUT course ”PuheenkäsiMelyn menetelmät” in Spring 2013. Pasi PerQlä SGN14006 2015

Transcript of Mel$frequency)cepstral)) coefficients)(MFCCs)...

Page 1: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

SGN-­‐14006  Audio  and  Speech  Processing  

 Mel-­‐frequency  cepstral    coefficients  (MFCCs)  

and  gammatone  filter  banks  Slides  for  this  lecture  are  based  on  those  created  by  Katariina  Mahkonen  for  TUT  course  ”PuheenkäsiMelyn  menetelmät”  in  Spring  2013.    

 

Pasi  PerQlä  SGN-­‐14006  2015  

Page 2: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

IntroducQon  

•  MFCC  coefficients  model  the  spectral  energy  distribuQon  in  a  perceptually  meaningful  way  

•  MFCCs  are  the  most  widely-­‐used  acousQc  feature  for  speech  recogniQon,  speaker  recogniQon,  and  audio  classificaQon  

•  MFCCs  take  into  account  certain  properQes  of  the  human  auditory  system  –  CriQcal-­‐band  frequency  resoluQon  (approximately)  –  Log-­‐power  (dB  magnitudes)  

Page 3: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Spectrogram  of  piano  notes  C1  –  C8  

Note  that  the  fundamental  frequency  16,32,65,131,261,523,1045,2093,4186  Hz  doubles  in  each  octave  and  the  spacing  between  harmonic  parQals  doubles  too.  -­‐  Such  octave  change  is  perceived  as  ”doubling  the  height  of  the  note”  

f0  

f0  

f0  

Page 4: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Mel  scale  •  Mel-­‐frequency  scale  represents  subjecQve  (perceived)  pitch.  It  is  one  of  the  

perceptually  moQvated  frequency  scales  (see  figure  below).    –  Mel-­‐scale  is  constructed  using  pairwise  comparisons  of  sinusoidal  tones:  a  reference  

frequency  is  fixed  and  then  a  test  subject  (human  listener)  is  asked  to  adjust  the  frequency  of  the  other  tone  to  be  two  Qmes  higher  or  half  Qmes  lower.  

–  Models  the  non-­‐linear  percepQon  of  frequencies  in  the  human  auditory  system  •  For  comparison,  the  Bark  criQcal-­‐band  scale  has  been  constructed  based  on  

the  masking  properQes  of  nearby  frequency  components.  –  Constructed  by  filling  the  audible  bandwidth  with  adjacent  criQcal  bands  1…26  

•  Note  that  all  the  scales  are  related  and:    fMel  ≈  100fBark      (very  roughly)  

mm. on basilar membrane frequency / kHz frequency / mel frequency / Bark

Page 5: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Mel  scale  

     The  anchor  point  for  Mel  scale  is  chosen  so  that  1000  Hz  =  1000  Mel    

)700

1(log2595 10Hz

Melff +=

Page 6: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Piano  tones  C1  –  C5      

Mel-­‐frequency  spectrogram  

       

and          

Bark-­‐scale  spectrogram  

   

Page 7: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

•  Weber  rule  says  that  the  perceived  change  in  a  physical  quanQty  is  proporQonal  to  the  relaQve  change:  

 

•  Therefore  it  makes  sense  to  measure  sound  levels  in  decibels:  LI = 10log10(I)

ProperQes  of  human  hearing  –percepQon  of  loudness  differences  

Page 8: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Now  let’s  get  back  to  the  calculaQon  of  MFCC  coefficients…  The  most  widely-­‐used  acousQc  feature  used  to  represent  a  speech  frame  (in  

speech  recogniQon  for  example)  

Page 9: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

CalculaQon  of  MFCC  coefficients  –  Define  triangular  ”bandpass  filters”  uniformly  distributed  on  the  Mel  scale  (usually  about  40  filters  in  range  0…8kHz).  

•  Note  that  the  Mel  filter  bank  has  overlap  between  adjacent  frequency  bands.              The  center  (Mel  scale)  frequency  of  band  n  is  fMel,c(n)  •  Mel  filter  of  band  n  starts  at  0  amplitude  at  fMel,c  (n-­‐1)  •  has  maximum  amplitude  at  fMel,c(n)  and  decays  to  zero  at  fMel,c  (n+1)  

Linear  spacing  in  Mel  scale  

Page 10: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

CalculaQon  of  MFCC  coefficients  –  Pre-­‐emphasize  the  signal,  i.e.,  filter  with  H(z)=1-az-1,

0.95<a<0.99 –  The  signal  is  processed  in  short  windows  of  x(n).    – Window  the  short  signal  x(n)  with  a  window  funcQon  w(n)  –  take  DFT  of  x(n)  -­‐>  X(f)  –  Obtain  MFCC  –  proceed  to  next  window  

Page 11: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

CalculaQon  of  MFCC  coefficients  –  Define  triangular  ”bandpass  filters”  Wk, k=1,…,K  uniformly  distributed  on  the  Mel  scale  (usually  K=40 filters  in  range  0…8kHz).  

–  DFT  bin  energies  |X(f)|2  of  each  filter    are  weighted  with  kth  band’s  filter  shape  Wk(f)  and  accumulated.  

–  Take  logarithm  of  each  E(k), k=1,2,…K –  Calculate  discrete  cosine  transform  (DCT  II)  of  log  energies      à  cn  are  called  MFCCs

E(k) = Wk ( f ) X( f )2

f∑

cn = log(E(k))cos n k − 12

"

#$

%

&'πM

(

)*

+

,-k=1

K∑ , for n =1,...,K

Page 12: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Log-­‐Mel  energies  example    

•  Time  domain:  take  one  window  of  data  x(n)  •  Use  (pre-­‐emphasis  and)  windowing  •  Mel  scale  coefficients  in  matrix  W30,512  

•  MulQply  W  with  |X(f)|2    and  take  logarithm  

x   =  

30x512   512x1   30x1  

x  

30x512   512x121  30x121  

W  

W  

Qme  domain  signal   spectrogram  of  signal  

=  

Page 13: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

MFCCs  from  Log-­‐Mel  energies  example    

•  Apply  DCT  to  log  Mel  energy  spectrum  of  each  frame  

DCT-­‐II  

Page 14: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Why  are  MFCC  coefficients  successful  in  audio  classificaQon?  

–  Perceptually-­‐moQvated  (near  log-­‐f)  frequency  resoluQon  –  Perceptually-­‐moQvated  decibel-­‐magnitude  scale  –  Discrete  cosine  transform  decorrelates  the  features  (improves  staQsQcal  properQes  by  removing  correlaQons  between  the  features)  

–  Convenient  control  of  the  model  order:  picking  only  the  lowest  N  coefficients  gives  lower-­‐resoluQon  approximaQon  of  the  spectral  energy  distribuQon  (vocal  tract  etc.)  

Page 15: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Gammatone  filter  bank  •  Gammatone  filter  bank  emulates  human  hearing  by  simulaQng  the  impulse  response  of  the  auditory  nerve  fiber.  

•  Shape  resembles  a  tone  modulated  with  a  gamma-­‐funcQon.    

•  a is  peak  value,  tn-1  Qme  onset,  exp()  –term  defines  bandwidth  and  decay,  fc  is  characterisQc  frequency,  and  φ  is  iniQal  phase.    

•  Typically  42  bands,  from  30Hz  to  18kHz  •  Drawback:  Does  not  emulate  level-­‐dependent  characterisQcs  of  auditory  filters.    

g(t) = atn−1e−2πb( fc )t cos(2π fct +φ)

Page 16: Mel$frequency)cepstral)) coefficients)(MFCCs) …sgn14006/PDF2015/S04-MFCC.pdfCalculaon)of)MFCC)coefficients ) – Define)triangular)”bandpass)filters”)uniformly)distributed)

Example:  Gammatone  filter  bank  

a)  response  shape  (Qme  domain)  b)  magnitude  responses            of  40  filters  on  ERB  scale  c)  log  output  of  30  filters  d)  convenQonal          spectrogram    

a)  

b)  

c)  d)  

hMp://lqat.sourceforge.net/