NLP Interchange Format

23
www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti- innsbruck.at NLP Interchange Format José M. García

description

NLP Interchange Format. José M. García. Outline. What is NIF? Design requirements URI schemes NIF ontologies Use cases Relationship with ELRA Roadmap for NIF 2.0 Conclusions . What is NIF?. N atural Language Processing I nterchange F ormat - PowerPoint PPT Presentation

Transcript of NLP Interchange Format

Page 1: NLP Interchange Format

www.sti-innsbruck.at © Copyright 2008 STI INNSBRUCK www.sti-innsbruck.at

NLP Interchange Format

José M. García

Page 2: NLP Interchange Format

www.sti-innsbruck.at

Outline

• What is NIF?• Design requirements• URI schemes• NIF ontologies• Use cases• Relationship with ELRA• Roadmap for NIF 2.0• Conclusions

2

Page 3: NLP Interchange Format

www.sti-innsbruck.at 3

What is NIF?

• Natural Language Processing Interchange Format

• NIF is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.

• Building blocks– URI scheme for identifying elements in texts– Ontology for describing common NLP terms

• Created and maintained by AKSW group of University of Leipzig, during the LOD2 EU project.

• Community project: http://persistence.uni-leipzig.org/nlp2rdf/

Page 4: NLP Interchange Format

www.sti-innsbruck.at 4

NIF design requirements

Compatibility with RDF Coverage Structural

Interoperability

Conceptual Interoperability Granularity Provenance and

Confidence

Simplicity Scalability

Page 5: NLP Interchange Format

www.sti-innsbruck.at 5

URI schemes

• Text needs to be referenceable by URIs

• With URI references text can be used as resources in RDF statements

• NIF distinguishes:– Documents– Text of the document– Substrings of the text.

• URI scheme is an algorithm to create IDs for text and substrings

• URI elements– Document URI– Separator– Character indices

Page 6: NLP Interchange Format

www.sti-innsbruck.at 6

RFC 5147

• Canonical URI scheme for NIF is based on RFC 5147

• It standardizes fragment identifiers for text/plain media type

http://www.w3.org/DesignIssues/LinkedData.html

Page 7: NLP Interchange Format

www.sti-innsbruck.at 7

RFC 5147

• Canonical URI scheme for NIF is based on RFC 5147

• It standardizes fragment identifiers for text/plain media type

http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610

Page 8: NLP Interchange Format

www.sti-innsbruck.at 8

RFC 5147

• Canonical URI scheme for NIF is based on RFC 5147

• It standardizes fragment identifiers for text/plain media type

http://www.w3.org/DesignIssues/LinkedData.html

http://www.w3.org/DesignIssues/LinkedData.html#char=0,26610

http://www.w3.org/DesignIssues/LinkedData.html#char=1206,1218

Page 9: NLP Interchange Format

www.sti-innsbruck.at 9

NIF Core Ontology

• Classes and properties to describe relation between– Documents– Text– Substrings– Corresponding URI schemes

Page 10: NLP Interchange Format

www.sti-innsbruck.at 10

NIF Core Ontology

• Additional classes and properties (unstable/testing)

– More URI schemes

– Text structure (words, sentences, paragraphs…)

– Part of Speech (POS)

– Annotations with Stanbol

– Confidence

Page 11: NLP Interchange Format

www.sti-innsbruck.at 11

Workflows, Modularity and Extensibility of NIF

• Workflows for NLP integration– Normalization– Tokenization– Merge RDF annotations

Page 12: NLP Interchange Format

www.sti-innsbruck.at 12

Workflows, Modularity and Extensibility of NIF

• NIF ontology logical modules– Terminological model– Inference model– Validation model

• Vocabulary modules– FISE– ITS– OLiA– NERD– …

Page 13: NLP Interchange Format

www.sti-innsbruck.at 13

Workflows, Modularity and Extensibility of NIF

• Granularity profiles

Page 14: NLP Interchange Format

www.sti-innsbruck.at 14

ITS Use Case

• The Internationalization Tag Set 2.0 is a W3C working draft that is becoming a Recommendation.

• ITS standardizes HTML and XML attributes which can be used to annotate nodes with processing information for language service providers (i18n, l10n)

• ITS 2.0 RDF ontology was developed using NIF, including a round-trip conversion algorithm from ITS to NIF.

• NIF is expected to receive wide adoption by translation & language service providers

• ITS 2.0 RDF ontology provides properties which can be used to provide best practices for NLP annotations.

Page 15: NLP Interchange Format

www.sti-innsbruck.at 15

OLiA Use Case

• The Ontologies of Linguistic Annotation provide stable identifiers for morpho-syntactical annotation tag sets, so that NLP tools can use these ids for better interoperability.

• OLiA provides Annotation Models and a Reference Model, comprising more than 110 OWL ontologies for over 34 tag sets in 69 languages

• Features– Documentation– Flexible Granularity– Language Independence

• NIF provides two properties– nif:oliaIndividual (links a nif:String to an OLiA Annotation Model)– nif:oliaCategory (links to the Reference Model)

Page 16: NLP Interchange Format

www.sti-innsbruck.at 16

RDFaCE Use Case

• RDFa Content Editor is a rich text editor that supports WYSIWYM authoring including various views of the semantically enriched textual content.

• It combines results of different NLP APIs for automatic content annotation

– Heterogeneous APIs access, URI generation and output data structure– Solution: server-side proxy, hard-coded input and connection of each API.

• NIF simplified the integration, adding an interoperability layer

Page 17: NLP Interchange Format

www.sti-innsbruck.at 17

What is ELRA?

• European Language Resources Association

• http://www.elra.info

• Effort to make available Language Resources (LR) for language engineering and to evaluate language engineering technologies.

• LR marketplace

• Related organizations– ELDA (ELRA’s operational body)– LREC conferences

Page 18: NLP Interchange Format

www.sti-innsbruck.at 18

What is ELRA?

Page 19: NLP Interchange Format

www.sti-innsbruck.at 19

Relationship with NIF

• Different objectives

• LR written resources (esp. Corpora) can be annotated with NIF for further interoperability and integration with NLP tools

• ADVANTAGE: Large test data collection to evaluate NLP tools

• DISADVANTAGE: Cost of LR (though there are free ones)

Page 20: NLP Interchange Format

www.sti-innsbruck.at 20

Roadmap for NIF 2.0

• Release of NIF 1.0– DONE (Nov 2009)

• Release of NIF 2.0 Draft– CURRENT effort on solving pending issues– Adoption in ITS 2.0 W3C (soon-to-be) Recommendation– NIF-Core ontology is becoming stable– RLOG - an RDF Logging Ontology– NIF Validator software available

• Release of NIF 2.0 Core

• Release of NIF 2.0 Extensions– ITS ontology, PROV ontology, Lemon Ontology, NERD, UIMA, MARL opinion

ontology…

Page 21: NLP Interchange Format

www.sti-innsbruck.at 21

Conclusions

• NIF allows to integrate NLP tools using Linked Data

• Ongoing effort

• Many adopters and supporters– LOD2 EU project– Several W3C working groups– Named Entity Recognition and Disambiguation (NERD)– Ontologies of Linguistic Annotation (OLiA)– …

• 27 different implementations and use cases– Some available at http://persistence.uni-leipzig.org/nlp2rdf/

Page 22: NLP Interchange Format

www.sti-innsbruck.at © Copyright 2012 STI INNSBRUCK www.sti-innsbruck.at

Thanks for your attention

Questions?

22

Page 23: NLP Interchange Format

www.sti-innsbruck.at

References

1. http://persistence.uni-leipzig.org/nlp2rdf/

2. Integrating NLP using Linked Data by Sebastian Hellmann, Jens Lehmann, Sören Auer, and Martin Brümmer in 12th International Semantic Web Conference, 21-25 October 2013, Sydney, Australia

23