LAP: The CLARINO Language Analysis Portal · I Export interfaces with other CLARINO platforms such...

26
LAP: The CLARINO Language Analysis Portal Emanuele Lapponi, Stephan Oepen, Arne Skjærholt, and Erik Velldal University of Oslo, Department of Informatics October 16, 2015

Transcript of LAP: The CLARINO Language Analysis Portal · I Export interfaces with other CLARINO platforms such...

LAP: The CLARINO Language Analysis Portal

Emanuele Lapponi, Stephan Oepen, Arne Skjærholt, and Erik Velldal

University of Oslo,Department of Informatics

October 16, 2015

In this talk:

I Introduction & high level goals

I Design & Implementation: Galaxy

I LAF as a model for tool interchange in LAP

I Tool integration & versioning with the LAP tree

I Reaching out to other research communities

I Current state of development and future plans

2

Introduction

LAP:I A portal providing easy access to NLP toolsI Unlike other processing environments:

I LAP is not web-service based

I Tools run on a high-capacity compute cluster

I Annotation representation and interchange format

I Part of an ongoing PhD project that investigateshow NLP can benefit SSH research

3

Introduction

LAP:I A portal providing easy access to NLP toolsI Unlike other processing environments:

I LAP is not web-service based

I Tools run on a high-capacity compute cluster

I Annotation representation and interchange format

I Part of an ongoing PhD project that investigateshow NLP can benefit SSH research

3

Galaxy

A web-app platform foraccessing and configuring

tools, organizing datasets andannotations, and share results.

4

Galaxy

A web-app platform foraccessing and configuring

tools, organizing datasets andannotations, and share results.

4

Galaxy

A web-app platform foraccessing and configuring

tools, organizing datasets andannotations, and share results.

4

Galaxy

A web-app platform foraccessing and configuring

tools, organizing datasets andannotations, and share results.

4

Galaxy

A web-app platform foraccessing and configuring

tools, organizing datasets andannotations, and share results.

4

Galaxy

A GUI to build (potentiallycomplex) workflows

4

Tool interchange format

Requirements:I Stand-off

I Scalable in terms of coverage of linguistic informationand data volume

I Granular: on-demand access of relevant annotations

5

Tool interchange format

Data model: Linguistic Annotation Framework [2]I Stand-off: text regions linked to a graph that describes them

I Content agnostic

I Flexible structure

6

Tool interchange format

Implementation: MongoDBI Records describing the structural LAF elements:

Regions, Nodes and Edges

I Flexible data-access

7

Tool interchange format

id: “repp-r2”anchors: “6 11”

id: “repp-r3”anchors: “11 12”

id: “repp-r1”anchors: “0 5”

id: “repp-n1”link: “repp-r1”label: “Sandy”

id: “

hunp

os-e

1”

Region Node Edge

id: “repp-n2”link: “repp-r2”label: “barks”

id: “repp-n3”link: “repp-r3”

label: “.”id

: “hu

npos

-e2”

id: “

hunp

os-e

3”

id: “hunpos-n1”label: “NNP”

id: “hunpos-n2”label: “VBZ”

id: “hunpos-n3”label: “.”

id: “punkt-r1”anchors: “0 13”

id: “punkt-n1”link: “punkt-r1”label: “Sandy

barks.”

id: “malt-n2”label: “root”

id: “malt-n1”label: “nn”

id: “malt-n3”label: “punct”

id: “

mal

t-e3”

id: “malt-e2”

id: “

mal

t-e1”

id: “malt-e5”

id: “

mal

t-e4”

Sentence

Tokens

POS

Dependencies

8

Tool interchange format

Not in competition with richer end-user formats!

9

Tool integration and versioning: the LAP tree

A LAP tool is made of:I binaries for the actual annotator (i.e. the B&N parser)

I a wrapper that communicates with MongoDB

Which means:I Different programming languages

I Different virtual machines and interpreters

I Different versions

10

Tool integration and versioning: the LAP tree

A LAP tool is made of:I binaries for the actual annotator (i.e. the B&N parser)

I a wrapper that communicates with MongoDB

Which means:I Different programming languages

I Different virtual machines and interpreters

I Different versions

10

Tool integration and versioning: the LAP tree

The LAP TreeI A version controlled repository of the core LAP parts

(i.e. those that transcend Galaxy and the OS)

I Easily relocatable

I Enables reproducibility of experiments performed withhistorical versions of tools

11

Reaching out to other research communities

If we build it, will they come? [3]

12

Reaching out to other research communities

Our position:I Start out with actual research questions

I Work jointly with SSH researchers

I Investigate how and to what degree this work can be generalizedinto workflows

So far:I Joint work with Political Scientists

I Data-driven analysis of plenary debate speeches in the EuropeanParliament [1]

13

Reaching out to other research communities

Our position:I Start out with actual research questions

I Work jointly with SSH researchers

I Investigate how and to what degree this work can be generalizedinto workflows

So far:I Joint work with Political Scientists

I Data-driven analysis of plenary debate speeches in the EuropeanParliament [1]

13

Reaching out to other research communities

Talk of EuropeI A project that aims at curating EP datasets to linked data

Our contribution:I State-of-the-art syntacto-semantic annotations in rdf triples

(and possible ontological means to connect them to the ToE graph)

14

Current state of development and future plans

LAP, currently:I A feide- and eduGAIN-accessible development instance

I HPC-ready tools for English, Sami and Norwegian

I Tabulated, cg3 and rdf export

I Basic user documentation

15

Current state of development and future plans

Short- to mid-term goals:I Broaden the range of processing types (e.g. deep semantic parsing)

I Preprocessing tools for e.g. xml-datasets

I Export interfaces with other CLARINO platforms such as Corpuscleand Glossa

16

Thank you!

Thank you!https://lap.hpc.uio.no/

17

References

B. Høyland, J.-F. Godbout, E. Lapponi, and E. Velldal.Predicting party affiliations from European Parliament debates.In Proceedings of the 52nd Meeting of the Association forComputational Linguistics: Workshop on Language Technologiesand Computational Social Science, page 56 – 60, Baltimore, MD,USA, 2014.

N. Ide and K. Suderman.The Linguistic Annotation Framework: A standard for annotationinterchange and merging.Language Resources and Evaluation, (forthcoming), 2013.

J. v. Zundert.If you build it, will we come? Large scale digital infrastructures as adead end for digital humanities.Historical Social Research, 37(3), 2012.

18