The L2F Spoken Web Search system for Mediaeval 2012

1 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Alberto Abad and Ramón F. Astudillo L2F -‐ Spoken Language Systems Lab

INESC-‐ID Lisboa, Portugal [email protected]‐id.pt

Mediaeval 2012 Workshop Pisa, October 4, 2012

The L2F Spoken Web Search system for Mediaeval 2012


IntroducJon

  In this first L2F/INESC-‐ID parJcipaJon on SWS our main objecJves were: •  To learn

•  To have fun •  To build a reasonable system (in an unreasonable reduced amount of Jme)

  The submiVed L2F SWS system exploits hybrid ANN/HMM connecJonist methods for both query tokenizaJon and acousJc keyword search •  Composed by the fusion of 4 phoneJc-‐based SWS sub-‐systems

o  Based on our in-‐house ASR system named AUDIMUS

•  Each sub-‐system uses different language-‐dependent acousJc models: o  European Portuguese (pt), Brazilian Portuguese (br), European Spanish (es) and

American English (en)

•  Different detecJon score normalizaJon and fusion methods invesJgated o  SubmiVed system applies per-‐query score normalizaJon (Q-‐norm) and majority voJng

(MV) fusion


The baseline speech recognizer

  AUDIMUS is our in-‐house hybrid HMM/MLP speech recognizer

•  Feature extrac@on MulJ-‐stream 26 PLP, 26 logRASTA-‐PLP, 28 MSG and 39 ETSI (only 8kHz) •  MLP Several context input frames (13-‐15), 2 hidden-‐layers (500 units) and 1 output layer.

•  Output layer size 39 for pt, 40 for br, 30 for es, 41 for en. •  Data pt 115 hours (57 BN+58 telephone); br 13 hours of BN data; es 57hours (36 BN

+21 telephone); en 142 hours of BN data (HUB-‐4 96 & 97) •  HMM topology Single-‐state phonemes (3 frames minimum duraJon) •  Decoder Uses Weighted Finite-‐State Transducer (WFST) approach


Spoken Query TokenizaJon

  For each sub-‐system, obtain a phoneJc tokenizaJon of the queries •  Similar to LID parallel phonotacJc

approaches

  Use a phone-‐loop grammar with phoneme minimum duraJon of 3 frames to obtain 1-‐best phoneme chain •  AlternaJve n-‐best hypothesis for

charactering each query were explored with unsaJsfactory results o  Other possibiliJes not explored (yet): lakce, CN, etc…

•  Influence of the word inserJon penalty (wip) parameter on the tokenizaJon result


Spoken Query Search

  Spoken query search based on AKWS with our hybrid speech recognizer.   Search window of 5 seconds (2.5 seconds Jme shio)

•  Convert the problem in a verificaJon task (originally for with forced alignment)

•  Convenient for fusion purposes   Equally-‐likely 1-‐gram LM with target Query and Background word:

•  Query word o  Described by the phoneJc units obtained in the previous tokenizaJon stage

•  Background word o  Described by the special phoneJc class background/filler o  Minimum duraJon set to 250 msec.

  DetecJon score of detected candidates computed as the average phoneJc log-‐likelihood raJos of the query term


Spoken Query Search BG modelling with HMM/MLP

  Possible approaches •  Re-‐train MLP ✖ •  Compute posterior probability of the background class depending on the other

classes ✔ o  Mean probability of the top-‐N most likely outputs

  For the SWS system, •  We compute the average in the likelihood domain (top-‐6)

o  The decoder operates in the likelihood domain, so there is not need for and add-‐hoc esJmaJon of the BG class prior

•  We use a background scale term β (exponenJal in the likelihood domain) to control the weight of the BG model vs. Query

•  This β scale together with the wip term strongly affects searching results o  Adjusted following a non-‐exhausJve greedy search


Score normalizaJon, fusion and calibraJon

  Score normaliza@on schemes explored: •  Q-‐norm Assume that the scores are dependent of the queries and

apply a by-‐query normalizaJon •  F-‐norm Assume that the scores are dependent of the data file (of

the collecJon) and apply a by-‐file normalizaJon •  CombinaJons (QF-‐norm, FQ-‐norm)

  Fusion schemes explored: •  Candidate detecJons from the 4 parallel sub-‐systems are kept (or

rejected) according to simple combinaJon rules: o  AND (all), OR (at least 1 sys) and MV (at least 2 sys)

•  Final detecJon score is the mean score

  Decision threshold set according to maxATWV in dev-‐dev


Development experiments Sub-‐systems performance


Development experiments Score normalizaJon strategies


Development experiments Fusion strategies (aoer Q-‐norm)


Official L2F SWS2012 results

5

10

20

40

60

80

90

95

98

.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Mis

s pr

obab

ility

(in %

)

False Alarm probability (in %)

Combined DET Plot

Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.633 Scr=-0.435

Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.633 Scr=-0.435

5

10

20

40

60

80

90

95

98

.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Mis

s pr

obab

ility

(in %

)


Combined DET Plot



5

10

20

40

60

80

90

95

98

.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Mis

s pr

obab

ility

(in %

)


Combined DET Plot



dev-‐eval eval-‐dev eval-‐eval


Conclusions

  The L2F Spoken Web Search system fully exploits hybrid ANN/HMM speech recogniJon: •  Query tokenizaJon based on 1-‐best phoneJc decoding •  Query search based on AKWS (with no need for AM re-‐training)

  The submiVed system is formed by the fusion of four language-‐dependent sub-‐systems: •  Q-‐norm score normalizaJon is applied to each individual sub-‐system •  Fusion is done following a majority voJng strategy

  The system achieves an actual ATWV score of 0.5195 in the eval-‐eval •  Promising given the simplicity of the proposed system •  Robust to query and collecJon sets

o  Best performance is achieved in a mismatched condiJon!! •  Reasonably well-‐calibrated

  Future work Focused in improved Query tokenizaJon, fusion with other type of approaches (DTW)

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 13

technology from seed

L2 F - Spoken Language Systems Laboratory

The L2F Spoken Web Search system for Mediaeval 2012

Technology

Transcript of The L2F Spoken Web Search system for Mediaeval 2012