The Spoken Web Search Task
-
Upload
mediaeval2012 -
Category
Technology
-
view
655 -
download
1
Transcript of The Spoken Web Search Task
MediaEval 2012Spoken Web Search
Florian Metze, Marelie Davel, Etienne Barnard, Xavier Anguera, Guillaume Gravier, and Nitendra Rajput
Pisa, October 4, 2012
The Spoken Web Search Task
Data and Scoring
Organizers and Participants
Results
Discussion
Outline
Organizers
Florian Metze (Carnegie Mellon)
Etienne Barnard, Marelie Davel, Charl v. Heerden (North-West
University)
Xavier Anguera (Telefonica Research)
Guillaume Gravier (IRISA)
Nitendra Rajput (IBM India)
Real life audio content is very diverse!
“2011 Indian Data”
Spoken Web Search Task Motivation
Any speech problem can be solved with enough:
Money, Time, Constraints, Data
What if we have just one constraints?
Don’t know what language/ dialect is being used? Don’t have much data!
But don’t have to do Large Vocabulary Speech Recognition,
“only” content retrieval
What can be done?
Port outside resources (ie. run language-independent/-portable recognizer)
Build a “zero knowledge” approach (i.e. try to directly identify similar
words)
“Lwazi” Corpus
Lwazi means Knowledge
Lwazi project aims to develop telephony-based speech-driven information system
11 South-African languages, 3h-6h of speech per language
Phone sets, dictionaries, read & spontaneous, …
3200 utterances used, from 4 languages
Primary Data Source: “African Data”
Data obtained during targeted effort
Meant as resource for speech research, so no “found” data, as “Indian Data”
E. Barnard, M. Davel, and C. van Heerden, "ASR Corpus design for resource-scarce languages," in Proc. INTERSPEECH, Brighton, UK; Sep. 2009, pp. 2847-2850.
Evaluation Paradigm:Spoken Term Detection (STD)
Do not attempt to convert speech to text (full recognition, ASR)
Attempt to detect the occurrence (or absence) of “keywords”
STD is not easier than doing ASR
It requires less resources: particularly not a strong language model
Evaluation metrics:
(Spoken) Document Retrieval (SDR), when relaxing time constraints
Actual Term Weighted Value (ATWV, MTWV – defined by NIST)
Evaluation Idea – 4 Conditions
Test development terms on (known) development data
Test (unknown) evaluation terms on (unknown) evaluation data
Test development terms on evaluation data
Test evaluation terms on development data
Terms provided as audio examples taken from collections
Systems could be developed with or without using external
resources (i.e. other speech data, it is important to document, which
ones were used – “restricted” vs “open”)
NIST Scoring Tools
Developed for 2006 Spoken Term Detection
Generates “Actual” and “Maximum Term Weighted Value” (ATWV, MTWV)
Generates DET curves
Adapted by us
ECF = “Experiment Control File” (controls which sections to process)
RTTM = “Rich Transcription Time Mark” (defines references)
TLIST =“Term List” Files (links term IDs and word dictionary)
A few parameters to choose
Different for 2011 and 2012, to better represent characteristics of SWS task (thanks, Xavi)
Best ATWV value is 1, below 0 possible
Most useful plot
If done right, will give you
P(Miss) over P(FA) for all
decision scores
A “marker” at the actual
decision
If computed using score, this
will be on the line
Used for evaluation (with
score.occ.txt)
How to Interpret DET Plots
2012 Spoken Web Search Participants
Authors Title
Haipeng Wang and Tan Lee CUHK System for the Spoken Web Search task at Mediaeval 2012
Cyril Joder, Felix Weninger, Martin Wöllmer and Björn Schuller
The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task
Andi Buzo, Horia Cucu, Mihai Safta, Bogdan Ionescu, and Corneliu Burileanu
ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection
Alberto Abad and Ramón F. Astudillo The L2F Spoken Web Search system for Mediaeval 2012
Jozef Vavrek, Matus Pleva and Jozef Juhar
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
Amparo Varona, Mikel Penagarikano, Luis Javier Rodriguez-Fuentes, German Bordel, and Mireia Diez
GTTS System for the Spoken Web Search Task at MediaEval 2012
Igor Szoke, Michal Fapšo, and Karel Veselý
BUT 2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012
Aren Jansen, Benjamin Van Durme, and Pascal Clark
The JHU-HLTCOE Spoken Web Search System for MediaEval 2012
Xavier Anguera (TID) Telefonica Research System for the Spoken Web Search task at Mediaeval 2012
Summary of (Primary) Results
Team Type Dev Eval
CUHK cuhk_phnrecgmmasm_p-fusionprf_1 open 0,7824 0,7430
CUHK cuhk_spch_p-gmmasmprf_1 restricted 0,6776 0,6350
L2F l2f_12_spch_p-phonetic4_fusion_mv_1 open 0,5313 0,5195
BUT BUT_spch_p-akws-devterms_1 open 0,4884 0,4918
BUT BUT_spch_g-DTW-devterms_1 open 0,4426 0,4477
JHU-HLTCOE jhu_all_spch_p-rails_1 restricted 0,3811 0,3688
TID sws2012_IRDTW restricted 0,3866 0,3301
TUM tum_spch_p-cdtw_1 restricted 0,2628 0,2895
ARF arf_spch_p-asrDTWAlign_w15_a08_b04 open 0,4109 0,2448
GTTS gtts_spch_p-phone_lattice_1 open 0,0978 0,0809
TUKE tuke_spch_p-dtwsvm restricted 0 0
Development Data, Development Terms
Development Data, Evaluation Terms
Evaluation Data, Development Terms
Evaluation Data, Evaluation Terms
Spoken Web Search TaskSummary 1
Second time around
Last year’s participants (mostly) became organizers
Grew from 5 to ca. 10 participants!!!
Europe, America, Asia, Africa (where’s Australia and Antarctica?)
Interesting differences in performance
Thank you all participants! It was fun & interesting.
Evaluation criteria useful, correct?
Spoken Web Search TaskSummary 2
Could talk a bit about JHU-HLTCOE’s “RAILS” system
Next steps?
Do more joint analysis (hope everybody’s results agree with ours?)
Shared Publications? ICASSP? Journal?
Develop task further for next year?
“Speech Kitchen” idea will be presented later …
Thank You!
Coefficients C, V
Weighting of correct vs
incorrect detections
Probability of a Term
Expectation of terms
Average and Maximum TWV
P(FA) and P(Miss)
Optimal decision score
Values used for padding and
multi-term detections are
missing
In some rare cases lists
different values for total and
only sub-class
Was expecting more
questions
How to Interpret *.occ.txt File
The tools assume you use a
“decision score”
Submit “candidates” with
score lower than cutoff
Submit “detections” with
score higher than cutoff
Enables plotting of DET
curves
Can be confusing
Used different parameters for
African and Indian data sets
to reflect different use cases
KoefV/ KoefC are debatable
What’s the cost of wrong and
the benefit of correct
detections
-P Probability-of-Term
How frequent are terms
expected to be?
Parameters used
Can be used to analyze
decision score behavior
P(FA) False Alarms
P(Miss) Missed Detections
Resulting TWV
How to Interpret score.det.thresh.pdf
ARF2012/arf_
spch
_g-asr
DTWAlig
n_w15_a01_b04
ARF2012/arf_
spch
_g-asr
SausageAlig
n_w10_a02_b00
ARF2012/arfl
_spch
_g-asr
LatticeG
ram
mar
BUT2012/BUT_sp
ch_g-D
TW-d
evterm
s_1
CUHK2012/cuhk_
phnrecg
mm
asm_g-fu
sion-1
CUHK2012/cuhk_
spch
_g-gm
masm
_1
GTTS2012/g
tts_sp
ch_p-p
hone_lattic
e_1
JHU2012/jh
u_all_sp
ch_p-ra
ils_1
L2F2012/l2f_
12_spch
_p-phonetic
4_fusio
n_mv_
1
TID2012/s
ws2012_IR
DTW
TUM2012/tu
m_phonre
c_g-c
dtwPhonPro
b_20
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Dev-Dev MTWV-ATWV differences
ARF2012/arf_
spch
_g-asr
DTWAlig
n_w15_a01_b04
ARF2012/arf_
spch
_g-asr
SausageAlig
n_w10_a02_b00
ARF2012/arfl
_spch
_g-asr
LatticeG
ram
mar
BUT2012/BUT_sp
ch_g-D
TW-d
evterm
s_1
CUHK2012/cuhk_
phnrecg
mm
asm_g-fu
sion-1
CUHK2012/cuhk_
spch
_g-gm
masm
_1
GTTS2012/g
tts_sp
ch_p-p
hone_lattic
e_1
JHU2012/jh
u_all_sp
ch_p-ra
ils_1
L2F2012/l2f_
12_spch
_p-phonetic
4_fusio
n_mv_
1
TID2012/s
ws2012_IR
DTW
TUM2012/tu
m_phonre
c_g-c
dtwPhonPro
b_20
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Eval-Eval MTWV-ATWV differences
ARF2012/arf_
spch
_g-asr
DTWAlig
n_w15_a01_b04
ARF2012/arf_
spch
_g-asr
SausageAlig
n_w10_a02_b00
ARF2012/arfl
_spch
_g-asr
LatticeG
ram
mar
BUT2012/BUT_sp
ch_g-D
TW-d
evterm
s_1
CUHK2012/cuhk_
phnrecg
mm
asm_g-fu
sion-1
CUHK2012/cuhk_
spch
_g-gm
masm
_1
GTTS2012/g
tts_sp
ch_p-p
hone_lattic
e_1
JHU2012/jh
u_all_sp
ch_p-ra
ils_1
L2F2012/l2f_
12_spch
_p-phonetic
4_fusio
n_mv_
1
TID2012/s
ws2012_IR
DTW
TUM2012/tu
m_phonre
c_g-c
dtwPhonPro
b_20
0.05
0.1
0.15
0.2
0.25
Dev-Eval MTWV-ATWV differences
ARF2012/arf_
spch
_g-asr
DTWAlig
n_w15_a01_b04
ARF2012/arf_
spch
_g-asr
SausageAlig
n_w10_a02_b00
ARF2012/arfl
_spch
_g-asr
LatticeG
ram
mar
BUT2012/BUT_sp
ch_g-D
TW-d
evterm
s_1
CUHK2012/cuhk_
phnrecg
mm
asm_g-fu
sion-1
CUHK2012/cuhk_
spch
_g-gm
masm
_1
GTTS2012/g
tts_sp
ch_p-p
hone_lattic
e_1
JHU2012/jh
u_all_sp
ch_p-ra
ils_1
L2F2012/l2f_
12_spch
_p-phonetic
4_fusio
n_mv_
1
TID2012/s
ws2012_IR
DTW
TUM2012/tu
m_phonre
c_g-c
dtwPhonPro
b_20
0.05
0.1
0.15
0.2
0.25
Eval-Dev MTWV-ATWV differences