The Spoken Web Search Task

MediaEval 2012Spoken Web Search

Florian Metze, Marelie Davel, Etienne Barnard, Xavier Anguera, Guillaume Gravier, and Nitendra Rajput

Pisa, October 4, 2012

The Spoken Web Search Task

Data and Scoring

Organizers and Participants

Results

Discussion

Outline

Organizers

Florian Metze (Carnegie Mellon)

Etienne Barnard, Marelie Davel, Charl v. Heerden (North-West

University)

Xavier Anguera (Telefonica Research)

Guillaume Gravier (IRISA)

Nitendra Rajput (IBM India)

Real life audio content is very diverse!

“2011 Indian Data”

Spoken Web Search Task Motivation

Any speech problem can be solved with enough:

Money, Time, Constraints, Data

What if we have just one constraints?

Don’t know what language/ dialect is being used? Don’t have much data!

But don’t have to do Large Vocabulary Speech Recognition,

“only” content retrieval

What can be done?

Port outside resources (ie. run language-independent/-portable recognizer)

Build a “zero knowledge” approach (i.e. try to directly identify similar

words)

“Lwazi” Corpus

Lwazi means Knowledge

Lwazi project aims to develop telephony-based speech-driven information system

11 South-African languages, 3h-6h of speech per language

Phone sets, dictionaries, read & spontaneous, …

3200 utterances used, from 4 languages

Primary Data Source: “African Data”

Data obtained during targeted effort

Meant as resource for speech research, so no “found” data, as “Indian Data”

E. Barnard, M. Davel, and C. van Heerden, "ASR Corpus design for resource-scarce languages," in Proc. INTERSPEECH, Brighton, UK; Sep. 2009, pp. 2847-2850.

Evaluation Paradigm:Spoken Term Detection (STD)

Do not attempt to convert speech to text (full recognition, ASR)

Attempt to detect the occurrence (or absence) of “keywords”

STD is not easier than doing ASR

It requires less resources: particularly not a strong language model

Evaluation metrics:

(Spoken) Document Retrieval (SDR), when relaxing time constraints

Actual Term Weighted Value (ATWV, MTWV – defined by NIST)

Evaluation Idea – 4 Conditions

Test development terms on (known) development data

Test (unknown) evaluation terms on (unknown) evaluation data

Test development terms on evaluation data

Test evaluation terms on development data

Terms provided as audio examples taken from collections

Systems could be developed with or without using external

resources (i.e. other speech data, it is important to document, which

ones were used – “restricted” vs “open”)

NIST Scoring Tools

Developed for 2006 Spoken Term Detection

Generates “Actual” and “Maximum Term Weighted Value” (ATWV, MTWV)

Generates DET curves

Adapted by us

ECF = “Experiment Control File” (controls which sections to process)

RTTM = “Rich Transcription Time Mark” (defines references)

TLIST =“Term List” Files (links term IDs and word dictionary)

A few parameters to choose

Different for 2011 and 2012, to better represent characteristics of SWS task (thanks, Xavi)

Best ATWV value is 1, below 0 possible

Most useful plot

If done right, will give you

P(Miss) over P(FA) for all

decision scores

A “marker” at the actual

decision

If computed using score, this

will be on the line

Used for evaluation (with

score.occ.txt)

How to Interpret DET Plots

2012 Spoken Web Search Participants

Authors Title

Haipeng Wang and Tan Lee CUHK System for the Spoken Web Search task at Mediaeval 2012

Cyril Joder, Felix Weninger, Martin Wöllmer and Björn Schuller

The TUM Cumulative DTW Approach for the Mediaeval 2012 Spoken Web Search Task

Andi Buzo, Horia Cucu, Mihai Safta, Bogdan Ionescu, and Corneliu Burileanu

ARF @ MediaEval 2012: A Romanian ASR-based Approach to Spoken Term Detection

Alberto Abad and Ramón F. Astudillo The L2F Spoken Web Search system for Mediaeval 2012

Jozef Vavrek, Matus Pleva and Jozef Juhar

TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM

Amparo Varona, Mikel Penagarikano, Luis Javier Rodriguez-Fuentes, German Bordel, and Mireia Diez

GTTS System for the Spoken Web Search Task at MediaEval 2012

Igor Szoke, Michal Fapšo, and Karel Veselý

BUT 2012 APPROACHES FOR SPOKEN WEB SEARCH - MEDIAEVAL 2012

Aren Jansen, Benjamin Van Durme, and Pascal Clark

The JHU-HLTCOE Spoken Web Search System for MediaEval 2012

Xavier Anguera (TID) Telefonica Research System for the Spoken Web Search task at Mediaeval 2012

Summary of (Primary) Results

Team Type Dev Eval

CUHK cuhk_phnrecgmmasm_p-fusionprf_1 open 0,7824 0,7430

CUHK cuhk_spch_p-gmmasmprf_1 restricted 0,6776 0,6350

L2F l2f_12_spch_p-phonetic4_fusion_mv_1 open 0,5313 0,5195

BUT BUT_spch_p-akws-devterms_1 open 0,4884 0,4918

BUT BUT_spch_g-DTW-devterms_1 open 0,4426 0,4477

JHU-HLTCOE jhu_all_spch_p-rails_1 restricted 0,3811 0,3688

TID sws2012_IRDTW restricted 0,3866 0,3301

TUM tum_spch_p-cdtw_1 restricted 0,2628 0,2895

ARF arf_spch_p-asrDTWAlign_w15_a08_b04 open 0,4109 0,2448

GTTS gtts_spch_p-phone_lattice_1 open 0,0978 0,0809

TUKE tuke_spch_p-dtwsvm restricted 0 0

Development Data, Development Terms

Development Data, Evaluation Terms

Evaluation Data, Development Terms

Evaluation Data, Evaluation Terms

Spoken Web Search TaskSummary 1

Second time around

Last year’s participants (mostly) became organizers

Grew from 5 to ca. 10 participants!!!

Europe, America, Asia, Africa (where’s Australia and Antarctica?)

Interesting differences in performance

Thank you all participants! It was fun & interesting.

Evaluation criteria useful, correct?

Spoken Web Search TaskSummary 2

Could talk a bit about JHU-HLTCOE’s “RAILS” system

Next steps?

Do more joint analysis (hope everybody’s results agree with ours?)

Shared Publications? ICASSP? Journal?

Develop task further for next year?

“Speech Kitchen” idea will be presented later …

Thank You!

Coefficients C, V

Weighting of correct vs

incorrect detections

Probability of a Term

Expectation of terms

Average and Maximum TWV

P(FA) and P(Miss)

Optimal decision score

Values used for padding and

multi-term detections are

missing

In some rare cases lists

different values for total and

only sub-class

Was expecting more

questions

How to Interpret *.occ.txt File

The tools assume you use a

“decision score”

Submit “candidates” with

score lower than cutoff

Submit “detections” with

score higher than cutoff

Enables plotting of DET

curves

Can be confusing

Used different parameters for

African and Indian data sets

to reflect different use cases

KoefV/ KoefC are debatable

What’s the cost of wrong and

the benefit of correct

detections

-P Probability-of-Term

How frequent are terms

expected to be?

Parameters used

Can be used to analyze

decision score behavior

P(FA) False Alarms

P(Miss) Missed Detections

Resulting TWV

How to Interpret score.det.thresh.pdf

ARF2012/arf_

spch

_g-asr

DTWAlig

n_w15_a01_b04

ARF2012/arf_

spch

_g-asr

SausageAlig

n_w10_a02_b00

ARF2012/arfl

_spch

_g-asr

LatticeG

ram

mar

BUT2012/BUT_sp

ch_g-D

TW-d

evterm

s_1

CUHK2012/cuhk_

phnrecg

mm

asm_g-fu

sion-1

CUHK2012/cuhk_

spch

_g-gm

masm

_1

GTTS2012/g

tts_sp

ch_p-p

hone_lattic

e_1

JHU2012/jh

u_all_sp

ch_p-ra

ils_1

L2F2012/l2f_

12_spch

_p-phonetic

4_fusio

n_mv_

1

TID2012/s

ws2012_IR

DTW

TUM2012/tu

m_phonre

c_g-c

dtwPhonPro

b_20

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Dev-Dev MTWV-ATWV differences

ARF2012/arf_

spch

_g-asr

DTWAlig

n_w15_a01_b04

ARF2012/arf_

spch

_g-asr

SausageAlig

n_w10_a02_b00

ARF2012/arfl

_spch

_g-asr

LatticeG

ram

mar

BUT2012/BUT_sp

ch_g-D

TW-d

evterm

s_1

CUHK2012/cuhk_

phnrecg

mm

asm_g-fu

sion-1

CUHK2012/cuhk_

spch

_g-gm

masm

_1

GTTS2012/g

tts_sp

ch_p-p

hone_lattic

e_1

JHU2012/jh

u_all_sp

ch_p-ra

ils_1

L2F2012/l2f_

12_spch

_p-phonetic

4_fusio

n_mv_

1

TID2012/s

ws2012_IR

DTW

TUM2012/tu

m_phonre

c_g-c

dtwPhonPro

b_20

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

Eval-Eval MTWV-ATWV differences

ARF2012/arf_

spch

_g-asr

DTWAlig

n_w15_a01_b04

ARF2012/arf_

spch

_g-asr

SausageAlig

n_w10_a02_b00

ARF2012/arfl

_spch

_g-asr

LatticeG

ram

mar

BUT2012/BUT_sp

ch_g-D

TW-d

evterm

s_1

CUHK2012/cuhk_

phnrecg

mm

asm_g-fu

sion-1

CUHK2012/cuhk_

spch

_g-gm

masm

_1

GTTS2012/g

tts_sp

ch_p-p

hone_lattic

e_1

JHU2012/jh

u_all_sp

ch_p-ra

ils_1

L2F2012/l2f_

12_spch

_p-phonetic

4_fusio

n_mv_

1

TID2012/s

ws2012_IR

DTW

TUM2012/tu

m_phonre

c_g-c

dtwPhonPro

b_20

0.05

0.1

0.15

0.2

0.25

Dev-Eval MTWV-ATWV differences

ARF2012/arf_

spch

_g-asr

DTWAlig

n_w15_a01_b04

ARF2012/arf_

spch

_g-asr

SausageAlig

n_w10_a02_b00

ARF2012/arfl

_spch

_g-asr

LatticeG

ram

mar

BUT2012/BUT_sp

ch_g-D

TW-d

evterm

s_1

CUHK2012/cuhk_

phnrecg

mm

asm_g-fu

sion-1

CUHK2012/cuhk_

spch

_g-gm

masm

_1

GTTS2012/g

tts_sp

ch_p-p

hone_lattic

e_1

JHU2012/jh

u_all_sp

ch_p-ra

ils_1

L2F2012/l2f_

12_spch

_p-phonetic

4_fusio

n_mv_

1

TID2012/s

ws2012_IR

DTW

TUM2012/tu

m_phonre

c_g-c

dtwPhonPro

b_20

0.05

0.1

0.15

0.2

0.25

Eval-Dev MTWV-ATWV differences

The Spoken Web Search Task

Technology

Transcript of The Spoken Web Search Task