Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech...
-
Upload
jessie-hankinson -
Category
Documents
-
view
226 -
download
3
Transcript of Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech...
Results of R&D: BLaRK for Dutch
Helmer StrikDept. of LinguisticsCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands
Radboud University Nijmegen
Radboud University NijmegenCape Town, 24-11-2008 2
Introduction
Terminology:BLaRK: Basic Language Resources KitBaTaVo: Basis Taal-VoorzieningenPlatform-BC: see this presentation
Period 2000 – : plans – 2002 : results, future
Radboud University NijmegenCape Town, 24-11-2008 3
NTU & Dutch HLT Platform
NTU - Nederlandse Taalunie(Dutch Language Union)Mission: Strengthening the position of the Dutch Language
Dutch HLT PlatformAim: To contribute to the further development of an adequate language and speech technology infrastructure for Dutch
Radboud University NijmegenCape Town, 24-11-2008 4
HLT platform Participants
Flanders:Ministry of the Flemish CommunityIWT (Flemish Institute for the Promotion of Scientific-technological Research in Industry)FWO (Fund for Scientific Research - Flanders)
Netherlands:Dutch Ministry of Education, Culture and SciencesDutch Ministry of Economic AffairsSenter (agency of Dutch Ministry of Economic Affairs)NWO (Netherlands Organisation for Scientific Research)
Radboud University NijmegenCape Town, 24-11-2008 5
Objectives
Strengthening the position of Dutch in HLTEstablishing the proper conditions for a successful management and maintenance of basic HLT resources developed through governmental fundingStimulating co-operation between academia and industry in the field of HLTContributing to the realisation of European co-operation in HLT-relevant areasEstablishing a network that brings together supply and demand for knowledge, products, and services
Radboud University NijmegenCape Town, 24-11-2008 6
Action plan
‘Action plan for Dutch in language and speech technology’ was defined to achieve objectives
Activities organised in four action lines (A, B, C, and D)
Radboud University NijmegenCape Town, 24-11-2008 7
Dutch HLT PlatformFour action lines
A. Performing a market place functionB. Strengthening the HLT infrastructureC. Working out standards and evaluation criteriaD. Developing a management, maintenance, and
distribution plan
Radboud University NijmegenCape Town, 24-11-2008 8
Action line A
Encourage co-operation between industry, academia and policy institutionsRaise awareness and give publicity to the results of HLT research
“Performing a market place function”
Radboud University NijmegenCape Town, 24-11-2008 9
Action line B
Defining the BLaRK (Basic Language Resources Kit) for DutchCarrying out a survey to determine what is needed to complete the BLaRK: field surveyDrawing up a priority list with cost estimates serving as policy guidelines
“Strengthening the digital language infrastructure”
Radboud University NijmegenCape Town, 24-11-2008 10
Action line C
Drawing up standards and criteria for evaluation of basic materials in BLaRK and for assessment of project results
“Working out standards and evaluation criteria”
Radboud University NijmegenCape Town, 24-11-2008 11
Action line D
Defining a Blueprint for management including intellectual property rights, maintenance, and distribution of HLT resources
“Developing a management, maintenance, and distribution plan”
Radboud University NijmegenCape Town, 24-11-2008 13
Actions carried out
Conducted mailings to contacts (about 1000)
Contacted and visited companies with HLT related needs, to:
Demonstrate benefits of HLTGet clear picture of company’s knowledge status and future plansProvide information on cross-linking services
Organised seminars and workshops
Radboud University NijmegenCape Town, 24-11-2008 14
Platform BC
A. Performing a market place functionB. Strengthening the HLT infrastructureC. Working out standards and evaluation criteriaD. Developing a management, maintenance, and
distribution plan
B+C Platform BC
Radboud University NijmegenCape Town, 24-11-2008 15
Platform BCWho?
Steering committee:8 HLT expertsNTUNWO (funding body)
Field survey, 4 researchers2 language technology2 speech technology
Radboud University NijmegenCape Town, 24-11-2008 16
Platform BCWho?
Steering committee: 8 HLT experts
Lang. Tech. Speech Tech.
Flanders 1. WD2. FvE
1. JPM2. DvC
Netherlands 1. GB2. AN/DH/FdJ
1. HS2. RV / AD
Radboud University NijmegenCape Town, 24-11-2008 17
Platform BCHow?
Three stages:1. Defining the BLaRK for Dutch2. Making inventory of HLT resources3. Establishing priority list
Radboud University NijmegenCape Town, 24-11-2008 18
BLaRK: Basic Language Resources Kit
Components:Data: sets of language data and descriptions in machine readable formModules (or semi-products): the basic software components of HLT applicationsApplications: classes of applications rather than specific applications or products
2 matrices:1. Modules x Data2. Applications x Modules
BLaRK
Radboud University NijmegenCape Town, 24-11-2008 19
Data Applications
Modules
mon
olin
gle
x
mul
tili
nle
x
thes
auri
anno
cor
p
unan
noco
rp
spee
chco
rp
mul
ti li
ngco
rp
mul
tim
od c
orp
corp
mul
tim
edia
cor
CA
LL
acce
ssco
ntro
l
spee
chin
put
spee
chou
tput
dial
ogsy
stem
s
doc
prod
info
acce
ss
tran
sla-
tion
Language Technology
Grapheme-phonemeconv.
++ ++ + ++ ++ + +
Token detection ++ + ++ + + + + + +Sent boundary detection + ++ ++ + ++ ++ + ++ ++ ++Name recognition + + + ++ ++ ++ + ++ ++ + ++ ++ ++Spelling correction +Lemmatising ++ ++ + + + + + + + +Morphological analysis ++ ++ + + + ++ + ++ ++ ++Morphological synthesis ++ ++ + + ++ + ++ ++Word sort disambig. ++ ++ + + ++ + ++ ++ ++ ++Parsers and grammars ++ ++ + ++ ++ ++ ++ ++ ++Shallow parsing ++ ++ ++ + ++ ++ ++ ++ ++ ++Constituent recognition ++ ++ + + ++ ++ ++ ++ ++ ++Semantic analysis ++ ++ ++ ++ ++ + ++ ++ ++ ++ ++Referent resolution + ++ ++ + + ++ ++ ++ ++ ++Word meaning disambig. + ++ ++ + + ++ + + + ++ ++Pragmatic analysis + + ++ ++ ++ + ++ ++ ++ + ++Text generation ++ ++ ++ ++ ++ + ++ ++ ++ ++Lang. dep. translation ++ ++ ++ ++ + ++ ++
Speech Technology
Complete speech recog. ++ + ++ + ++ + ++ ++ ++ ++ ++ ++ ++ ++ ++Acoustic models ++ + ++ + ++ + + + ++ + ++ ++ + + +Language models + ++ + + + + + ++ + ++ ++ ++ ++ ++Pronunciation lexicon ++ + + ++ + + + ++ + ++ + ++ + ++ ++Robust speechrecognition
+ + + + + + ++ + + ++ ++ + + +
Non-native speech recog. + ++ + ++ ++ + + ++ + + + + +Speaker adaptation + + + ++ + + ++ + + ++ + + ++ +Lexicon adaptation ++ + + ++ + + + ++ + ++ + ++ + ++ ++Prosody recognition + + ++ + ++ + + + ++ + ++ ++ ++ ++ ++Complete speech synth. ++ + + + + + ++ ++ + + ++Allophone synthesis + + + + + + + + + +Di-phone synthesis ++ + + + + + ++ ++ + + +Unit selection ++ + + + + + ++ ++ + + +Prosody prediction forText-to-Speech
++ + + + + + ++ ++ ++ + ++
Aut. phon. transcription ++ ++ + + ++ + + + ++ + + + + + + +Aut. phon. segmentation ++ ++ + + ++ + + + ++ + + + + + + +Phoneme alignment + + + ++ + + + ++ + + + +Distance calc. phonemes + + + ++ + + + ++ + + + +Speaker identification + ++ ++ ++ + ++ + + ++ + + + +Speaker verification + ++ ++ ++ + ++ + ++ + + + +Speaker tracking + ++ ++ ++ + ++ + + + + +Language identification + ++ + + ++ ++ + + + + + + + +Dialect identification + ++ + + ++ ++ + + + + + + + +Confidence measures + + + ++ + ++ + ++ ++ ++ ++ + + +Utterance verification + + + ++ + + + + + ++ ++ + + +
DatDataa
ApplicationApplicationssModulesModules
LanguageLanguage
TechnologTechnologyy
SpeechSpeech
TechnologTechnologyy
Quantify:Quantify:
0, 1, or 20, 1, or 2
(+’s)(+’s)
Field surveyField survey
&&
Expert Expert opinionsopinions
Radboud University NijmegenCape Town, 24-11-2008 20
BLaRKLanguage technology
• Modules• Robust modular text preprocessing• Morphological analysis and morphosyntactic disambiguation• Robust syntactic analysis• Aspects of semantic analysis (word meaning and reference)
• Data• Monolingual lexicon• Annotated corpus of written Dutch• Benchmarks for evaluation
Radboud University NijmegenCape Town, 24-11-2008 21
BLaRKSpeech technology
• Modules• Automatic speech recognition• Speech synthesis system• Tools for annotation of speech corpora• Confidence measures and utterance verification• Identification (speaker, language, dialect)
• Data• Monolingual speech corpora for specific applications• Multilingual speech corpora • Multimodal/medial speech corpora • Benchmarks for evaluation
Radboud University NijmegenCape Town, 24-11-2008 22
From BLaRK to priority lists
1. BLaRK: Basic Language Resources Kit2. Inventory & Evaluation3. Priority lists
BLaRK inventory
priority
Radboud University NijmegenCape Town, 24-11-2008 23
2. Inventory & Evaluation
Inventory:Which components in BLaRK are available?
BoughtFreely obtainableReusableOf sufficient quality
Evaluation:And of sufficient quality?Checklist approach (vs. formal evaluation)
Radboud University NijmegenCape Town, 24-11-2008 24
Modules Availability
Grapheme-phoneme conversion 8
Token detection 9
Sentence boundary detection 3
Name recognition 4
Spelling correction 3
Lemmatising 9
Morphological analysis 7Morphological synthesis 9Word sort disambiguation 7
Parsers and grammars 3
Shallow parsing 2
Constituent recognition 5
Semantic analysis 3
Referent resolution 2
Word meaning disambiguation 2
Pragmatic analysis 1
Text generation 3
Language dependent translation 3
Complete speech recognition 4
Acoustic models 8
Language models 3
Pronunciation lexicon 5
Robust speech recognition 2
Non-native speech recognition 2
Speaker adaptation 2
Lexicon adaptation 2
Prosody recognition 2
Complete speech synthesis 6
Allophone synthesis 7
Di-phone synthesis 6
Unit selection 1
Prosody prediction for Text-to-Speech 3
Autom. phonetic transcription 3
Autom. phonetic segmentation 5
Phoneme alignment 8
Distance calculation of phonemes 8
Speaker identification 2
Speaker verification 2
Speaker tracking 2
Language identification 2
Dialect identification 2
Confidence measures 2
Utterance verification 2
Data
Unannotated corpora 9
Annotated corpora 5
Speech corpora 4
Multi lingual corpora 3
Multi modal corpora 1
Multi media corpora 1
Test corpora 1
Monolingual lexicons 8
Multilingual lexicons 6
Thesaurus 4
AvailabiliAvailabilityty
Quantify:Quantify:
1-101-10
Field surveyField survey
&&
Expert Expert opinionsopinions
ModulesModules
DatDataa
Radboud University NijmegenCape Town, 24-11-2008 25
3. Priority lists
The prioritisation was based on the following requirements:
The components should currently be unavailable, inaccessible, or of insufficient quality.
The components should be relevant for a large number of applications.
Developing the components should be possible in the short term.
Radboud University NijmegenCape Town, 24-11-2008 26
Consensus, broad support
Report version 1Feedback Academia & industry
Sent to the Dutch-Flemish HLT field (1000 sites)Workshop 15/11/2001
Report version 2, final version
Radboud University NijmegenCape Town, 24-11-2008 27
From BLaRK to priority lists
1. BLaRK2. Inventory & Eval.3. Priority listsReport Report
11
Feedback:Feedback:•HLT FieldHLT Field
•WorkshopWorkshop
1. BLaRK2. Inventory & Eval.3. Priority lists
Report Report 22
Radboud University NijmegenCape Town, 24-11-2008 28
Report
Version 1:
Version 2, final version:W. Daelemans & H. Strik (eds.) (2002)Het Nederlands in taal- en spraaktechnologie:prioriteiten voor basisvoorzieningen
Radboud University NijmegenCape Town, 24-11-2008 29
Recommendations (1)
Met betrekking tot de BaTaVo:Verzamelen van bestaande onderdelenVervolledigen (stimulering, fondsen)Beheer & onderhoud (actielijn D)Aanbieden, ‘open’ licentieEvaluatie: testcorpora & methodologie
Radboud University NijmegenCape Town, 24-11-2008 30
Recommendations (2)
Algemeen:Meer Taal & Spraak-technologen(opleiding, scholing, projecten)Meer samenwerkingNaast middelen voor toepassingsgericht onderzoek, ook middelen voor fundamenteel onderzoek
Radboud University NijmegenCape Town, 24-11-2008 31
Priority listLanguage technology
1. Annotated corpus of written Dutch2. Syntactic analysis3. Robust text pre-processing4. Semantic annotations for treebank in 15. Translation equivalents6. Benchmarks for evaluation
Radboud University NijmegenCape Town, 24-11-2008 32
Priority listSpeech technology
1. Automatic speech recognition2. Speech corpora3. Multi-media speech corpora4. Tools for (semi-) automatic transcription of speech
data5. Speech synthesis6. Benchmarks for evaluation
Radboud University NijmegenCape Town, 24-11-2008 35
Future prospects [2002]
Action line A: • Stimulate HLT in The Netherlands and Flanders• Cooperation: industry, academia, etc.
Action line B & C:• Collect existing resources• Ensure priorities are realized
Action line D:• Implementation of recommendations in the
Blueprint
Radboud University NijmegenCape Town, 24-11-2008 36
When BLaRK is established...
Intellectual rights by NTUActual management and maintenance of resources by HLT agency, to be foundedMaintenance of expertise by
Dutch-Flemish steering committees and HLT management committee, both to be founded
Radboud University NijmegenCape Town, 24-11-2008 37
General conclusions [2002]
Goals have been achieved so that the proper prior conditions for development of materials in BLaRK are created
This work, carried out in the Dutch speaking area, can be profitable for others when starting similar activities
Part of the report is translated into English Presentations & publications
Other domains Other countrie
http://lands.let.kun.nl/~strik/BLaRK.html
Radboud University NijmegenCape Town, 24-11-2008 38
Questions?
Radboud University NijmegenCape Town, 24-11-2008 39