CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and...

20
CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics UiL- OTS (NL) INFuture, Zagreb Nov 7 2007

Transcript of CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and...

Page 1: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities

Steven KrauwerUtrecht institute of Linguistics UiL-OTS (NL)

INFuture, Zagreb Nov 7 2007

Page 2: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 2

Overview

• Problem & Mission

• Some why-questions• Approach• How we work and who we are• Why this talk

• Summing up

Page 3: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 3

The problem

• Much data in digital archives language based• Many archives only known to local insiders and

mostly unconnected• Every archive has its own standards for

storage and access, normally only simple retrieval of files (text, audio or video documents)

• Social sciences and humanities researchers are often not aware of the potential benefits of using language and speech technology tools, and these tools are hard to use for non-specialist

Page 4: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 4

The CLARIN Mission

What: • Create an infrastructure that makes language

resources and technology (LRT)available to scholars of all disciplines, especially social sciences and humanities (SSH)

How: • Unite existing digital archives into a federation of

connected archives with unified web access• Provide language and speech technology tools

as web services operating on language data in archives

Page 5: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 5

Why a European infrastructure?

• too much fragmentation• lack of coordination• lack of visibility• lack of interoperability• lack of sustainability • expertise exists but not in all countries• language independent tools can be shared• language dependent tools can often be ported• most countries not able to bear the cost

Page 6: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 6

Why now?

• Exponential growth of digital data• Maturity of language and speech technology:

– allows for high speed processing– allows for large volumes– allows for new research questions

• Growing interest at EU level in research infrastructures (RI) for the ERA

• ESFRI RI Roadmap published in 2006 includes 34 proposals for RIs

• all of them will get EC funding for a 1-3 year preparatory phase

Page 7: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 7

Overall plan for CLARIN

Preparatory phase 2008 – 2010:• Put everything in place to get started for real• Build prototype• Budget in preparatory phase

– 4.1 M€ from EC– ??? M€ from participating countries

Construction phase 2011 – 2015:• Build and populate with tools and resourcesExploitation phase 2016 - ….• CLARIN in full serviceOverall budget 2008 - 2020: ca 200 M€

Page 8: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 8

4-dimensional approach for the prep phase

• The technical dimension

• The language dimension

• The user dimension

• The governance and legal dimension

Page 9: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 9

Technical

• Technical specification of the infrastructure• Construction of a prototype• Validation on rich variety of

– languages (>20)– resources– services– based on existing resources and tools (i.e. not

a digitization or tools creation project)• Strong focus on interoperability standards• Conversion of existing resources• Encapsulation of existing tools

Page 10: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 10

Strong sustainable centers

Page 11: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 11

Languages

• Intention to cover all languages spoken or studied in participating countries

• Representational and descriptive standards should be adequate and validated for all languages

• Same minimal coverage of basic resources and tools for all languages is to be defined (and implemented if additional funds are available)

Page 12: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 12

Language activities

• Survey of resources and tools, including:– encoding and annotation data– quality indicators

• agreeing on taxonomies and ontologies• agreeing on common standardsFocus on• integration of tools• interoperability• usage scenarios• if possible creation of missing essential resources• validating specifications and prototype

Page 13: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 13

User

Users are SSH scholars• Do WE know what they need?• Do THEY know what they need?Actions:• analyze past and ongoing SSH projects• user consultation• launch typical example projects to show

potential• create expertise centers• awareness actions

Page 14: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 14

Governance, fundingand legal issues

Agree on e.g.:• Who is going to pay for the construction and

exploitation of the infrastructure• How will the costs be shared• How will it be managed• How will it be coordinated with national policiesActions:• Analyse best practice in funding and

management of transnational projects• Prepare agreement between (now) 22 countries

about long term joint funding of CLARIN• Set up IPR framework

Page 15: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 15

How we work

• Most tasks executed in Working Groups• WGs consist of project partners & other experts

(CLARIN is open for contributions by others!)• Some WGs do work (e.g. build prototype),

others create consensus• Participation by others essential as e.g.

standards cannot be imposed by a small group• Unfortunately no funding available for WG

participation by others – only influence!

Page 16: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 16

Who we are

• The CLARIN consortium has 32 partners from 22 EU and associated countries, including Croatia (FFZG)

• The CLARIN community has 92 members in 32 countries (Nov 07)

Leading partners are:• Utrecht University (Steven Krauwer coordinator)• Max Planck Institute Nijmegen (Peter Wittenburg)

• Hungarian Academy of Sciences (Tamas Varadi)

Page 17: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 17

National vs EC funding

EC funds managed by consortium, will pay for– generic tasks (e.g. research, prototyping, coordination,

dissemination)– participation by a single national coordination point in

every country (in HR: FFZG Zagreb)

National funds to be managed nationally, will pay for– participation by other sites in the country– taking care of own language and priorities (standards, &

validation, adaptation of tools & resources)– carrying out example humanities projects– (hopefully) participating in Working Groups

Page 18: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 18

Why this talk?

• Invitation to join CLARIN:– We need user involvement

– We need archives willing to join the federation

– We need experts for our centers of expertise

– We need example humanities projects for the preparatory phase

Page 19: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 19

Summing up (1)

• CLARIN is about to embark on its 3 year Preparatory Phase project aimed at designing and building an LRT infrastructure for the SSH

• It can only work with support from the whole SSH community, both inside and outside the EU

• Please join us if you feel you can and want to contribute. We don’t pay you but don’t charge you either – it’s free!

Contact:• http://www.clarin.eu, [email protected]• or your national contact point

Page 20: CLARIN: Common Language Resources and Technology Infrastructure for the Social Sciences and Humanities Steven Krauwer Utrecht institute of Linguistics.

Steven Krauwer INFuture 2007, Zagreb 20

Summing up (2)

One day any SSH scholar should be able to ask without any difficulty:

• “List all uses of enthusiasm in 19th century English novels written by women”

• “Find all video clips of Tony Blair on BBC in 2007”

• “Summarize Le Monde of October 7th 2007 – in Croatian”