Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech...

36
Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands Radboud University Nijmegen

Transcript of Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech...

Page 1: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Results of R&D: BLaRK for Dutch

Helmer StrikDept. of LinguisticsCentre for Language and Speech Technology (CLST)Radboud University Nijmegen, the Netherlands

Radboud University Nijmegen

Page 2: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 2

Introduction

Terminology:BLaRK: Basic Language Resources KitBaTaVo: Basis Taal-VoorzieningenPlatform-BC: see this presentation

Period 2000 – : plans – 2002 : results, future

Page 3: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 3

NTU & Dutch HLT Platform

NTU - Nederlandse Taalunie(Dutch Language Union)Mission: Strengthening the position of the Dutch Language

Dutch HLT PlatformAim: To contribute to the further development of an adequate language and speech technology infrastructure for Dutch

Page 4: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 4

HLT platform Participants

Flanders:Ministry of the Flemish CommunityIWT (Flemish Institute for the Promotion of Scientific-technological Research in Industry)FWO (Fund for Scientific Research - Flanders)

Netherlands:Dutch Ministry of Education, Culture and SciencesDutch Ministry of Economic AffairsSenter (agency of Dutch Ministry of Economic Affairs)NWO (Netherlands Organisation for Scientific Research)

Page 5: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 5

Objectives

Strengthening the position of Dutch in HLTEstablishing the proper conditions for a successful management and maintenance of basic HLT resources developed through governmental fundingStimulating co-operation between academia and industry in the field of HLTContributing to the realisation of European co-operation in HLT-relevant areasEstablishing a network that brings together supply and demand for knowledge, products, and services

Page 6: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 6

Action plan

‘Action plan for Dutch in language and speech technology’ was defined to achieve objectives

Activities organised in four action lines (A, B, C, and D)

Page 7: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 7

Dutch HLT PlatformFour action lines

A. Performing a market place functionB. Strengthening the HLT infrastructureC. Working out standards and evaluation criteriaD. Developing a management, maintenance, and

distribution plan

Page 8: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 8

Action line A

Encourage co-operation between industry, academia and policy institutionsRaise awareness and give publicity to the results of HLT research

“Performing a market place function”

Page 9: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 9

Action line B

Defining the BLaRK (Basic Language Resources Kit) for DutchCarrying out a survey to determine what is needed to complete the BLaRK: field surveyDrawing up a priority list with cost estimates serving as policy guidelines

“Strengthening the digital language infrastructure”

Page 10: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 10

Action line C

Drawing up standards and criteria for evaluation of basic materials in BLaRK and for assessment of project results

“Working out standards and evaluation criteria”

Page 11: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 11

Action line D

Defining a Blueprint for management including intellectual property rights, maintenance, and distribution of HLT resources

“Developing a management, maintenance, and distribution plan”

Page 12: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 13

Actions carried out

Conducted mailings to contacts (about 1000)

Contacted and visited companies with HLT related needs, to:

Demonstrate benefits of HLTGet clear picture of company’s knowledge status and future plansProvide information on cross-linking services

Organised seminars and workshops

Page 13: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 14

Platform BC

A. Performing a market place functionB. Strengthening the HLT infrastructureC. Working out standards and evaluation criteriaD. Developing a management, maintenance, and

distribution plan

B+C Platform BC

Page 14: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 15

Platform BCWho?

Steering committee:8 HLT expertsNTUNWO (funding body)

Field survey, 4 researchers2 language technology2 speech technology

Page 15: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 16

Platform BCWho?

Steering committee: 8 HLT experts

Lang. Tech. Speech Tech.

Flanders 1. WD2. FvE

1. JPM2. DvC

Netherlands 1. GB2. AN/DH/FdJ

1. HS2. RV / AD

Page 16: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 17

Platform BCHow?

Three stages:1. Defining the BLaRK for Dutch2. Making inventory of HLT resources3. Establishing priority list

Page 17: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 18

BLaRK: Basic Language Resources Kit

Components:Data: sets of language data and descriptions in machine readable formModules (or semi-products): the basic software components of HLT applicationsApplications: classes of applications rather than specific applications or products

2 matrices:1. Modules x Data2. Applications x Modules

BLaRK

Page 18: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 19

Data Applications

Modules

mon

olin

gle

x

mul

tili

nle

x

thes

auri

anno

cor

p

unan

noco

rp

spee

chco

rp

mul

ti li

ngco

rp

mul

tim

od c

orp

corp

mul

tim

edia

cor

CA

LL

acce

ssco

ntro

l

spee

chin

put

spee

chou

tput

dial

ogsy

stem

s

doc

prod

info

acce

ss

tran

sla-

tion

Language Technology

Grapheme-phonemeconv.

++ ++ + ++ ++ + +

Token detection ++ + ++ + + + + + +Sent boundary detection + ++ ++ + ++ ++ + ++ ++ ++Name recognition + + + ++ ++ ++ + ++ ++ + ++ ++ ++Spelling correction +Lemmatising ++ ++ + + + + + + + +Morphological analysis ++ ++ + + + ++ + ++ ++ ++Morphological synthesis ++ ++ + + ++ + ++ ++Word sort disambig. ++ ++ + + ++ + ++ ++ ++ ++Parsers and grammars ++ ++ + ++ ++ ++ ++ ++ ++Shallow parsing ++ ++ ++ + ++ ++ ++ ++ ++ ++Constituent recognition ++ ++ + + ++ ++ ++ ++ ++ ++Semantic analysis ++ ++ ++ ++ ++ + ++ ++ ++ ++ ++Referent resolution + ++ ++ + + ++ ++ ++ ++ ++Word meaning disambig. + ++ ++ + + ++ + + + ++ ++Pragmatic analysis + + ++ ++ ++ + ++ ++ ++ + ++Text generation ++ ++ ++ ++ ++ + ++ ++ ++ ++Lang. dep. translation ++ ++ ++ ++ + ++ ++

Speech Technology

Complete speech recog. ++ + ++ + ++ + ++ ++ ++ ++ ++ ++ ++ ++ ++Acoustic models ++ + ++ + ++ + + + ++ + ++ ++ + + +Language models + ++ + + + + + ++ + ++ ++ ++ ++ ++Pronunciation lexicon ++ + + ++ + + + ++ + ++ + ++ + ++ ++Robust speechrecognition

+ + + + + + ++ + + ++ ++ + + +

Non-native speech recog. + ++ + ++ ++ + + ++ + + + + +Speaker adaptation + + + ++ + + ++ + + ++ + + ++ +Lexicon adaptation ++ + + ++ + + + ++ + ++ + ++ + ++ ++Prosody recognition + + ++ + ++ + + + ++ + ++ ++ ++ ++ ++Complete speech synth. ++ + + + + + ++ ++ + + ++Allophone synthesis + + + + + + + + + +Di-phone synthesis ++ + + + + + ++ ++ + + +Unit selection ++ + + + + + ++ ++ + + +Prosody prediction forText-to-Speech

++ + + + + + ++ ++ ++ + ++

Aut. phon. transcription ++ ++ + + ++ + + + ++ + + + + + + +Aut. phon. segmentation ++ ++ + + ++ + + + ++ + + + + + + +Phoneme alignment + + + ++ + + + ++ + + + +Distance calc. phonemes + + + ++ + + + ++ + + + +Speaker identification + ++ ++ ++ + ++ + + ++ + + + +Speaker verification + ++ ++ ++ + ++ + ++ + + + +Speaker tracking + ++ ++ ++ + ++ + + + + +Language identification + ++ + + ++ ++ + + + + + + + +Dialect identification + ++ + + ++ ++ + + + + + + + +Confidence measures + + + ++ + ++ + ++ ++ ++ ++ + + +Utterance verification + + + ++ + + + + + ++ ++ + + +

DatDataa

ApplicationApplicationssModulesModules

LanguageLanguage

TechnologTechnologyy

SpeechSpeech

TechnologTechnologyy

Quantify:Quantify:

0, 1, or 20, 1, or 2

(+’s)(+’s)

Field surveyField survey

&&

Expert Expert opinionsopinions

Page 19: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 20

BLaRKLanguage technology

• Modules• Robust modular text preprocessing• Morphological analysis and morphosyntactic disambiguation• Robust syntactic analysis• Aspects of semantic analysis (word meaning and reference)

• Data• Monolingual lexicon• Annotated corpus of written Dutch• Benchmarks for evaluation

Page 20: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 21

BLaRKSpeech technology

• Modules• Automatic speech recognition• Speech synthesis system• Tools for annotation of speech corpora• Confidence measures and utterance verification• Identification (speaker, language, dialect)

• Data• Monolingual speech corpora for specific applications• Multilingual speech corpora • Multimodal/medial speech corpora • Benchmarks for evaluation

Page 21: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 22

From BLaRK to priority lists

1. BLaRK: Basic Language Resources Kit2. Inventory & Evaluation3. Priority lists

BLaRK inventory

priority

Page 22: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 23

2. Inventory & Evaluation

Inventory:Which components in BLaRK are available?

BoughtFreely obtainableReusableOf sufficient quality

Evaluation:And of sufficient quality?Checklist approach (vs. formal evaluation)

Page 23: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 24

Modules Availability

Grapheme-phoneme conversion 8

Token detection 9

Sentence boundary detection 3

Name recognition 4

Spelling correction 3

Lemmatising 9

Morphological analysis 7Morphological synthesis 9Word sort disambiguation 7

Parsers and grammars 3

Shallow parsing 2

Constituent recognition 5

Semantic analysis 3

Referent resolution 2

Word meaning disambiguation 2

Pragmatic analysis 1

Text generation 3

Language dependent translation 3

Complete speech recognition 4

Acoustic models 8

Language models 3

Pronunciation lexicon 5

Robust speech recognition 2

Non-native speech recognition 2

Speaker adaptation 2

Lexicon adaptation 2

Prosody recognition 2

Complete speech synthesis 6

Allophone synthesis 7

Di-phone synthesis 6

Unit selection 1

Prosody prediction for Text-to-Speech 3

Autom. phonetic transcription 3

Autom. phonetic segmentation 5

Phoneme alignment 8

Distance calculation of phonemes 8

Speaker identification 2

Speaker verification 2

Speaker tracking 2

Language identification 2

Dialect identification 2

Confidence measures 2

Utterance verification 2

Data

Unannotated corpora 9

Annotated corpora 5

Speech corpora 4

Multi lingual corpora 3

Multi modal corpora 1

Multi media corpora 1

Test corpora 1

Monolingual lexicons 8

Multilingual lexicons 6

Thesaurus 4

AvailabiliAvailabilityty

Quantify:Quantify:

1-101-10

Field surveyField survey

&&

Expert Expert opinionsopinions

ModulesModules

DatDataa

Page 24: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 25

3. Priority lists

The prioritisation was based on the following requirements:

The components should currently be unavailable, inaccessible, or of insufficient quality.

The components should be relevant for a large number of applications.

Developing the components should be possible in the short term.

Page 25: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 26

Consensus, broad support

Report version 1Feedback Academia & industry

Sent to the Dutch-Flemish HLT field (1000 sites)Workshop 15/11/2001

Report version 2, final version

Page 26: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 27

From BLaRK to priority lists

1. BLaRK2. Inventory & Eval.3. Priority listsReport Report

11

Feedback:Feedback:•HLT FieldHLT Field

•WorkshopWorkshop

1. BLaRK2. Inventory & Eval.3. Priority lists

Report Report 22

Page 27: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 28

Report

Version 1:

Version 2, final version:W. Daelemans & H. Strik (eds.) (2002)Het Nederlands in taal- en spraaktechnologie:prioriteiten voor basisvoorzieningen

Page 28: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 29

Recommendations (1)

Met betrekking tot de BaTaVo:Verzamelen van bestaande onderdelenVervolledigen (stimulering, fondsen)Beheer & onderhoud (actielijn D)Aanbieden, ‘open’ licentieEvaluatie: testcorpora & methodologie

Page 29: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 30

Recommendations (2)

Algemeen:Meer Taal & Spraak-technologen(opleiding, scholing, projecten)Meer samenwerkingNaast middelen voor toepassingsgericht onderzoek, ook middelen voor fundamenteel onderzoek

Page 30: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 31

Priority listLanguage technology

1. Annotated corpus of written Dutch2. Syntactic analysis3. Robust text pre-processing4. Semantic annotations for treebank in 15. Translation equivalents6. Benchmarks for evaluation

Page 31: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 32

Priority listSpeech technology

1. Automatic speech recognition2. Speech corpora3. Multi-media speech corpora4. Tools for (semi-) automatic transcription of speech

data5. Speech synthesis6. Benchmarks for evaluation

Page 32: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 35

Future prospects [2002]

Action line A: • Stimulate HLT in The Netherlands and Flanders• Cooperation: industry, academia, etc.

Action line B & C:• Collect existing resources• Ensure priorities are realized

Action line D:• Implementation of recommendations in the

Blueprint

Page 33: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 36

When BLaRK is established...

Intellectual rights by NTUActual management and maintenance of resources by HLT agency, to be foundedMaintenance of expertise by

Dutch-Flemish steering committees and HLT management committee, both to be founded

Page 34: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 37

General conclusions [2002]

Goals have been achieved so that the proper prior conditions for development of materials in BLaRK are created

This work, carried out in the Dutch speaking area, can be profitable for others when starting similar activities

Part of the report is translated into English Presentations & publications

Other domains Other countrie

http://lands.let.kun.nl/~strik/BLaRK.html

Page 35: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 38

Questions?

Page 36: Results of R&D: BLaRK for Dutch Helmer Strik Dept. of Linguistics Centre for Language and Speech Technology (CLST) Radboud University Nijmegen, the Netherlands.

Radboud University NijmegenCape Town, 24-11-2008 39