LAP: The CLARINO Language Analysis Portal · I Export interfaces with other CLARINO platforms such...
Transcript of LAP: The CLARINO Language Analysis Portal · I Export interfaces with other CLARINO platforms such...
LAP: The CLARINO Language Analysis Portal
Emanuele Lapponi, Stephan Oepen, Arne Skjærholt, and Erik Velldal
University of Oslo,Department of Informatics
October 16, 2015
In this talk:
I Introduction & high level goals
I Design & Implementation: Galaxy
I LAF as a model for tool interchange in LAP
I Tool integration & versioning with the LAP tree
I Reaching out to other research communities
I Current state of development and future plans
2
Introduction
LAP:I A portal providing easy access to NLP toolsI Unlike other processing environments:
I LAP is not web-service based
I Tools run on a high-capacity compute cluster
I Annotation representation and interchange format
I Part of an ongoing PhD project that investigateshow NLP can benefit SSH research
3
Introduction
LAP:I A portal providing easy access to NLP toolsI Unlike other processing environments:
I LAP is not web-service based
I Tools run on a high-capacity compute cluster
I Annotation representation and interchange format
I Part of an ongoing PhD project that investigateshow NLP can benefit SSH research
3
Galaxy
A web-app platform foraccessing and configuring
tools, organizing datasets andannotations, and share results.
4
Galaxy
A web-app platform foraccessing and configuring
tools, organizing datasets andannotations, and share results.
4
Galaxy
A web-app platform foraccessing and configuring
tools, organizing datasets andannotations, and share results.
4
Galaxy
A web-app platform foraccessing and configuring
tools, organizing datasets andannotations, and share results.
4
Galaxy
A web-app platform foraccessing and configuring
tools, organizing datasets andannotations, and share results.
4
Tool interchange format
Requirements:I Stand-off
I Scalable in terms of coverage of linguistic informationand data volume
I Granular: on-demand access of relevant annotations
5
Tool interchange format
Data model: Linguistic Annotation Framework [2]I Stand-off: text regions linked to a graph that describes them
I Content agnostic
I Flexible structure
6
Tool interchange format
Implementation: MongoDBI Records describing the structural LAF elements:
Regions, Nodes and Edges
I Flexible data-access
7
Tool interchange format
id: “repp-r2”anchors: “6 11”
id: “repp-r3”anchors: “11 12”
id: “repp-r1”anchors: “0 5”
id: “repp-n1”link: “repp-r1”label: “Sandy”
id: “
hunp
os-e
1”
Region Node Edge
id: “repp-n2”link: “repp-r2”label: “barks”
id: “repp-n3”link: “repp-r3”
label: “.”id
: “hu
npos
-e2”
id: “
hunp
os-e
3”
id: “hunpos-n1”label: “NNP”
id: “hunpos-n2”label: “VBZ”
id: “hunpos-n3”label: “.”
id: “punkt-r1”anchors: “0 13”
id: “punkt-n1”link: “punkt-r1”label: “Sandy
barks.”
id: “malt-n2”label: “root”
id: “malt-n1”label: “nn”
id: “malt-n3”label: “punct”
id: “
mal
t-e3”
id: “malt-e2”
id: “
mal
t-e1”
id: “malt-e5”
id: “
mal
t-e4”
Sentence
Tokens
POS
Dependencies
8
Tool integration and versioning: the LAP tree
A LAP tool is made of:I binaries for the actual annotator (i.e. the B&N parser)
I a wrapper that communicates with MongoDB
Which means:I Different programming languages
I Different virtual machines and interpreters
I Different versions
10
Tool integration and versioning: the LAP tree
A LAP tool is made of:I binaries for the actual annotator (i.e. the B&N parser)
I a wrapper that communicates with MongoDB
Which means:I Different programming languages
I Different virtual machines and interpreters
I Different versions
10
Tool integration and versioning: the LAP tree
The LAP TreeI A version controlled repository of the core LAP parts
(i.e. those that transcend Galaxy and the OS)
I Easily relocatable
I Enables reproducibility of experiments performed withhistorical versions of tools
11
Reaching out to other research communities
Our position:I Start out with actual research questions
I Work jointly with SSH researchers
I Investigate how and to what degree this work can be generalizedinto workflows
So far:I Joint work with Political Scientists
I Data-driven analysis of plenary debate speeches in the EuropeanParliament [1]
13
Reaching out to other research communities
Our position:I Start out with actual research questions
I Work jointly with SSH researchers
I Investigate how and to what degree this work can be generalizedinto workflows
So far:I Joint work with Political Scientists
I Data-driven analysis of plenary debate speeches in the EuropeanParliament [1]
13
Reaching out to other research communities
Talk of EuropeI A project that aims at curating EP datasets to linked data
Our contribution:I State-of-the-art syntacto-semantic annotations in rdf triples
(and possible ontological means to connect them to the ToE graph)
14
Current state of development and future plans
LAP, currently:I A feide- and eduGAIN-accessible development instance
I HPC-ready tools for English, Sami and Norwegian
I Tabulated, cg3 and rdf export
I Basic user documentation
15
Current state of development and future plans
Short- to mid-term goals:I Broaden the range of processing types (e.g. deep semantic parsing)
I Preprocessing tools for e.g. xml-datasets
I Export interfaces with other CLARINO platforms such as Corpuscleand Glossa
16
References
B. Høyland, J.-F. Godbout, E. Lapponi, and E. Velldal.Predicting party affiliations from European Parliament debates.In Proceedings of the 52nd Meeting of the Association forComputational Linguistics: Workshop on Language Technologiesand Computational Social Science, page 56 – 60, Baltimore, MD,USA, 2014.
N. Ide and K. Suderman.The Linguistic Annotation Framework: A standard for annotationinterchange and merging.Language Resources and Evaluation, (forthcoming), 2013.
J. v. Zundert.If you build it, will we come? Large scale digital infrastructures as adead end for digital humanities.Historical Social Research, 37(3), 2012.
18