Post on 16-Jan-2016
description
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 1
European Language Resources Association (ELRA)
HLT EvaluationsKhalid CHOUKRI
ELRA/ELDA55 Rue Brillat-Savarin, F-75013 Paris, FranceTel. +33 1 43 13 33 33 -- Fax. +33 1 43 13 33
30
Email: choukri@elda.org
http://www.elda.org/ or http://www.elra.info/
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 2
Presentation Outline• European language Resources Association
• Evaluation to drive research progress
• Human Language Technologies Evaluation(s)» What, why, for whom, how ….
• (Some figures from TC-STAR)
• Examples of Evaluation campaigns
• Demo …(available afterwards)
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 3
European Language Resource Association An Improved infrastructure for Data sharing & HLT evaluation
(1) An Association of users of Language Resources
(2) Infrastructure for the evaluation of Human Language Technologies
providing resources, tools, methodologies, logistics,
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 4
The Association
• Membership Drive:ELRA is Open to European & Non-European InstitutionsResources are available to Members & Non-Members
Pay per Resource
Substantial discounts on LR prices (over 70%), Substantial discounts on LREC registration fees Legal and contractual assistance with respect to LR mattersAccess to Validation and production manuals (Quality assessment)Figures and facts about the Market (results of ELRA surveys)Newsletter and other publications……………. New: Fidelity program … earn miles and get more benefits
• Some of the benefits of becoming a member:
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 5
ELRA: An efficient infrastructure to serve the HLT CommunityStrategies for the next Decade … New ELRA status:
2005- Extension of ELRA’s official mission to promote
LRs and evaluation for the Human Language
Technology (HLT):
The mission of the Association is to promote language resources
(henceforth LRs) and evaluation for the Human Language
Technology (HLT) sector in all their forms and all their uses;
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 6
Courtesy G.Thurmair Malta workshop.
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 7
Long term / high riskLarge return of investment Evolutionary
UsabilityAcceptability
BasicResearch
TechnologyDevelopment
ApplicationDevelopment
BottleneckIdentification
Research resultsin quantitative
evaluation
Technologiesnecessitated
for applications
Technologieswhich have been
validatedfor applications.
Meeting points with technology development
QuantitativeEvaluation
UsageEvaluation
What to evaluate … Levels of Evaluation
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 8
What to evaluate … Levels of Evaluation
-Basic Research Evaluation (validate research direction)
-Technology Evaluation (assessment of solution for well defined problem)
-Usage Evaluation (end-users in the field)
-Impact Evaluation (socio-economic consequences)
-Programme Evaluation (funding agencies)
Our concern
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 9
Why Evaluate?
• Validate research hypotheses
• Assess progress
• Choose between research alternatives
• Identify promising technologies (market)
• Benchmarking … state of the art
• Share knowledge … dedicated workshops
• Feedback … Funding agencies
• Share Costs ???
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 10
Progress & Evaluation (Courtesy Charles Wayn)
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 11
Technology performance & Applications
• Bad technology may be used to design useful applications
• What about good technology ? ….
» Software industry
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 12
HLT Evaluations …. For whom
• MT developers want to improve
the “quality” of MT output
• MT users (humans or software
e.g. CLIR ) want to improve
productivity using the most
suitable MT system (e.g.
multilinguality)
• ….
-Basic Research Evaluation (validate research direction)
-Technology Evaluation (assessment of solution for well defined problem)
-Usage Evaluation (end-users in the field)
-Impact Evaluation (socio-economic consequences)
-Programme Evaluation (funding agencies)
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 13
For whom … essential for technology development
Share of Information and knowledge between participants: (how to get the best results, access to data, scoring tools)
Information obtained by industrialists: state of the art, technology choice, market strategy, new products.
Information obtained by funding agencies: technology performance, progress/investment, priorities
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 14
Some types of evaluations
Comparative evaluation
the same or similar control tasks and related data with metrics that are agreed upon
Competitive vs Cooperative
Black box evaluation … Glass box
Objective evaluation … Subjective (Human-based)
Corpus based (test suites)
Quantitative measures … Qualitative
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 15
Comparative Evaluation of Technology
Used successfully in the USA by DARPA and NIST (since
1984)
Similar efforts in Europe on a smaller scale, mainly
projects (EU funded or national programs)
Select a common "task"
Attract enough Participants
Organize the campaign (protocol/metrics/data)
Follow-up workshop, interpret results and share info
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 16
Requirements for an evaluation campaign
• Referencial Language Resources (Data) (truth)• Metric(s): Automatic, Human judgments … scoring software• scale/range of performance to compare with (Baseline)
• Logistics’ Management
• reliability assessment: independent body
• Participants: technology providers
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 17
HLT Evaluation Portal… Pointers to projects
• Overview • HLT Evaluations • Activities by technology • Activities by geographical region • Players • Evaluation resources • Evaluation Services
http://www.hlt-evaluation.org/
Let us list some well known campaigns
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 18
Speech & Audio/soundASR: TC-STAR, CHIL, ESTER
TTS: TC-STAR, EVASY
Speaker identification (CHIL)
Speech 2 Speech Translation
Speech Understanding (Media)
Acoustic Person tracking
Speech activity detection, …..
………
Examples of Evaluation Campaigns – Capitalization
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 19
Multimodal --- Video – Vision technologies
Face Detection
Visual Person Tracking
Visual Speaker Identification
Head Pose Estimation
Hand Tracking
Examples of Evaluation Campaigns – Capitalization
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 20
Some of the technologies being evaluated within CHIL …http://chil.server.de/
A) Vision technologies – A.1) Face Detection
– A.2) Visual Person Tracking
– A.3) Visual Speaker Identification
– A.4) Head Pose Estimation
– A.5) Hand Tracking
B) Sound and Speech technologies – B.1) Close-Talking Automatic Speech Recognition
– B.2) Far-Field Automatic Speech Recognition
– B.3) Acoustic Person Tracking
– B.4) Acoustic Speaker Identification
– B.5) Speech Activity Detection
– B.6) Acoustic Scene Analysis
C) Contents Processing technologies – C.1) Automatic Summarisation … Question Answering
more at the CHIL/CLEAR workshops
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 21
Written NLP & Content
IR, CLIR , QA, (Amaryllis, EQUER, CLEF)
Text analysers (Grace, EASY)
MT (CESTA, TC-STAR)
Corpus alignement & processing (Arcade, Arcade-2,
Romanseval/Senseval, …)
Term & Terminology extraction
Summarisation
Examples of Evaluation Campaigns – Capitalization
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 22
Evaluation Projects …. The French sceneSome projects in NL, Italy, ...
Technolangue/Evalda: the Evalda platform consists of 8 evaluation
campaigns with a focus on the spoken and written language
technologies for the French language:
– ARCADE II: evaluation of bilingual corpora alignment systems.
– CESART: evaluation of terminology extraction systems.
– CESTA: evaluation of machine translation systems (Ar, Eng => Fr).
– EASY: evaluation of parsers.
– ESTER: evaluation of broadcast news automatic transcribing systems.
– EQUER: evaluation of question answering systems.
– EVASY: evaluation of speech synthesis systems.
– MEDIA: evaluation of in and out-of context dialog systems.
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 23
Some details from relevant projects
• CLEF
• TC-STAR
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 24
CLEF (Cross-Language Evaluation Forum)
Promoting research and development in Cross-Language Information Retrieval (CLIR)
(i) providing an infrastructure for the testing and evaluation of information retrieval systems - European languages - monolingual and cross-language contexts
(ii) creating test packages of reusable data which can be employed by system developers for benchmarking purposes.
Example of Evaluation Initiatives
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 25
34,0
1
23,5
32,5
24,5 28
45,5
28,6
4
43,6
5
65
56,5
48,5
44
60
37,1
8
27,5
43,5
25,5
42
23
64
27,5
49,5
64,5
36
58,5 61
,5
73,5
26,5
80,5
46,5
68
77,5
0
10
20
30
40
50
60
70
80
90
Bulgarian
Germ
an
English
Spanish
Finnish
French
Italian
Dutch
Portuguese
Best2004
Combination2004
Best2005
Combination2005
QA-CLEF: State of the art & Improvement
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 26
Back to Evaluation Tasks within TC-STAR (http://www.tc-star.org/)
2 categories of transcribing and translating tasks
– European Parliament Plenary Sessions: (EPPS): English (En) and Spanish (Es),
– Broadcast News (Voice of America VoA): Mandarin Chinese (Zh) and English (En)
TC-STAR: Speech to speech translation
•Packages with Speech recognition, speech translation, and speech synthesis
•Development and Test data, metrics & results.
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 27
TC-STAR evaluations……. 3 Consecutives annual evaluations
1.SLT in the following directions
i. Chinese-to-English (Broadcast News)
ii. Spanish-to-English (European Parliament plenary speeches)
iii. English-to-Spanish (European Parliament plenary speeches)
2.ASR in the following languages
i. English (European Parliament plenary speeches)
ii. Spanish (European Parliament plenary speeches)
iii. Mandarin Chinese (Broadcast News)
3.TTS in Chinese, English, and Spanish under the following conditions:
i. Complete system
ii. Voice conversion intralingual and crosslingual, expressive speech:
iii. Component evaluation
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 28
Improvement of SLT Performances (EnEs)
30
35
40
45
50
55
BLEU - 2006 system BLEU - 2007 system
BL
EU
[%
]
IBM (FTE)
IRST (FTE)
RWTH (FTE)
UPC (FTE)
IBM (Verb)
IRST (Verb)
RWTH (Verb)
UPC (Verb)
IBM (ASR)
IRST (ASR)
UPC (ASR)
Input = Text ,
Verbatim
Speech recognition
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 29
Improvement of SLT Performances (EsEn)
30
35
40
45
50
BLEU - 2005 system BLEU - 2006 system BLEU - 2007 system
BL
EU
[%
]
IRST (FTE)
RWTH (FTE)
UPC (FTE)
IRST (Verb)
LIMSI (Verb)
RWTH (Verb)
UPC (Verb)
IRST (ASR)
UPC (ASR)
Input = Text ,
Verbatim
Speech recognition
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 30
Improvement of ASR Performances (EnEs)
0
5
10
15
20
25
30
35
40
45
2004 2005 2006 2007
WE
R/C
ER
(%
)
English (EPPS)
Spanish (EPPS)
Spanish (CORTES)
Chinese (BN)
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 31
Human Evaluation Translations … EnEs adequacy (1-5)
11.5
22.5
33.5
44.5
5
FTE Verbatim ASR
Combinations Commercial
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 32
End-to-End
• The end-to-end evaluation is carried out for 1 translation direction: English-to-Spanish
• Evaluation of ASR (Rover) + SLT (Rover) +TTS (UPC) system
• Same segments as for SLT human evaluation• Evaluation tasks:
– Adequacy: comprehension test
– Fluency: judgement test with several questions related to fluency and also usability of the system
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 33
Fluency questionnaire
• [Understanding] Do you think that you have understood the message?1: Not at all , ...........5: Yes, absolutely
• [Fluent Speech] Is the speech in good Spanish?1: No, it is very bad ...... 5: Yes, it is perfect
• [Effort] Rate the listening effort1: Very high ............ 5: Low, as natural speech
• [Overall Quality] Rate the overall quality of this audio sample1: Very badm unusable ...... 5: It is very useful
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 34
End to End results(subjective test: 1…5 )
1
1.5
2
2.5
3
3.5
4
4.5
5
ITP TCSTAR
Understanding Fluent Speech Effort Overall Quality
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 35
TC-STAR Tasks
• More results from the 2007 Campaign
http://www.tc-star.org/
Evaluation packages available
ELRA & ELDAELRA & ELDATC-STAR General Meeting Lux. 2007-05 KC 36
Some concluding remarks on Technology evaluation
• It saves developers time and money
• It help assess progress accurately
• It produces reusable evaluation packages
• It helps to identify areas where more R&D is
needed