The L2F Spoken Web Search system for Mediaeval 2012
-
Upload
mediaeval2012 -
Category
Technology
-
view
673 -
download
0
description
Transcript of The L2F Spoken Web Search system for Mediaeval 2012
1 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Alberto Abad and Ramón F. Astudillo L2F -‐ Spoken Language Systems Lab
INESC-‐ID Lisboa, Portugal [email protected]‐id.pt
Mediaeval 2012 Workshop Pisa, October 4, 2012
The L2F Spoken Web Search system for Mediaeval 2012
2 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
IntroducJon
In this first L2F/INESC-‐ID parJcipaJon on SWS our main objecJves were: • To learn
• To have fun • To build a reasonable system (in an unreasonable reduced amount of Jme)
The submiVed L2F SWS system exploits hybrid ANN/HMM connecJonist methods for both query tokenizaJon and acousJc keyword search • Composed by the fusion of 4 phoneJc-‐based SWS sub-‐systems
o Based on our in-‐house ASR system named AUDIMUS
• Each sub-‐system uses different language-‐dependent acousJc models: o European Portuguese (pt), Brazilian Portuguese (br), European Spanish (es) and
American English (en)
• Different detecJon score normalizaJon and fusion methods invesJgated o SubmiVed system applies per-‐query score normalizaJon (Q-‐norm) and majority voJng
(MV) fusion
3 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
The baseline speech recognizer
AUDIMUS is our in-‐house hybrid HMM/MLP speech recognizer
• Feature extrac@on MulJ-‐stream 26 PLP, 26 logRASTA-‐PLP, 28 MSG and 39 ETSI (only 8kHz) • MLP Several context input frames (13-‐15), 2 hidden-‐layers (500 units) and 1 output layer.
• Output layer size 39 for pt, 40 for br, 30 for es, 41 for en. • Data pt 115 hours (57 BN+58 telephone); br 13 hours of BN data; es 57hours (36 BN
+21 telephone); en 142 hours of BN data (HUB-‐4 96 & 97) • HMM topology Single-‐state phonemes (3 frames minimum duraJon) • Decoder Uses Weighted Finite-‐State Transducer (WFST) approach
4 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Spoken Query TokenizaJon
For each sub-‐system, obtain a phoneJc tokenizaJon of the queries • Similar to LID parallel phonotacJc
approaches
Use a phone-‐loop grammar with phoneme minimum duraJon of 3 frames to obtain 1-‐best phoneme chain • AlternaJve n-‐best hypothesis for
charactering each query were explored with unsaJsfactory results o Other possibiliJes not explored (yet): lakce, CN, etc…
• Influence of the word inserJon penalty (wip) parameter on the tokenizaJon result
5 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Spoken Query Search
Spoken query search based on AKWS with our hybrid speech recognizer. Search window of 5 seconds (2.5 seconds Jme shio)
• Convert the problem in a verificaJon task (originally for with forced alignment)
• Convenient for fusion purposes Equally-‐likely 1-‐gram LM with target Query and Background word:
• Query word o Described by the phoneJc units obtained in the previous tokenizaJon stage
• Background word o Described by the special phoneJc class background/filler o Minimum duraJon set to 250 msec.
DetecJon score of detected candidates computed as the average phoneJc log-‐likelihood raJos of the query term
6 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Spoken Query Search BG modelling with HMM/MLP
Possible approaches • Re-‐train MLP ✖ • Compute posterior probability of the background class depending on the other
classes ✔ o Mean probability of the top-‐N most likely outputs
For the SWS system, • We compute the average in the likelihood domain (top-‐6)
o The decoder operates in the likelihood domain, so there is not need for and add-‐hoc esJmaJon of the BG class prior
• We use a background scale term β (exponenJal in the likelihood domain) to control the weight of the BG model vs. Query
• This β scale together with the wip term strongly affects searching results o Adjusted following a non-‐exhausJve greedy search
7 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Score normalizaJon, fusion and calibraJon
Score normaliza@on schemes explored: • Q-‐norm Assume that the scores are dependent of the queries and
apply a by-‐query normalizaJon • F-‐norm Assume that the scores are dependent of the data file (of
the collecJon) and apply a by-‐file normalizaJon • CombinaJons (QF-‐norm, FQ-‐norm)
Fusion schemes explored: • Candidate detecJons from the 4 parallel sub-‐systems are kept (or
rejected) according to simple combinaJon rules: o AND (all), OR (at least 1 sys) and MV (at least 2 sys)
• Final detecJon score is the mean score
Decision threshold set according to maxATWV in dev-‐dev
8 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Development experiments Sub-‐systems performance
9 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Development experiments Score normalizaJon strategies
10 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Development experiments Fusion strategies (aoer Q-‐norm)
11 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Official L2F SWS2012 results
5
10
20
40
60
80
90
95
98
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Mis
s pr
obab
ility
(in %
)
False Alarm probability (in %)
Combined DET Plot
Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.633 Scr=-0.435
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.633 Scr=-0.435
5
10
20
40
60
80
90
95
98
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Mis
s pr
obab
ility
(in %
)
False Alarm probability (in %)
Combined DET Plot
Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.486 Scr=-0.135
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.486 Scr=-0.135
5
10
20
40
60
80
90
95
98
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
Mis
s pr
obab
ility
(in %
)
False Alarm probability (in %)
Combined DET Plot
Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.523 Scr=-0.362
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.523 Scr=-0.362
dev-‐eval eval-‐dev eval-‐eval
12 The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012
Conclusions
The L2F Spoken Web Search system fully exploits hybrid ANN/HMM speech recogniJon: • Query tokenizaJon based on 1-‐best phoneJc decoding • Query search based on AKWS (with no need for AM re-‐training)
The submiVed system is formed by the fusion of four language-‐dependent sub-‐systems: • Q-‐norm score normalizaJon is applied to each individual sub-‐system • Fusion is done following a majority voJng strategy
The system achieves an actual ATWV score of 0.5195 in the eval-‐eval • Promising given the simplicity of the proposed system • Robust to query and collecJon sets
o Best performance is achieved in a mismatched condiJon!! • Reasonably well-‐calibrated
Future work Focused in improved Query tokenizaJon, fusion with other type of approaches (DTW)
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
technology from seed
L2 F - Spoken Language Systems Laboratory 13
technology from seed
L2 F - Spoken Language Systems Laboratory