Language Technology for Polish in...

30
CLARIN-PL Language Technology for Polish in Practice G 4.19 Language Technology and Computational Linguistics Research Group and CLARIN-PL Research Infrastructure in brief Maciej Piasecki, Marcin Pol, Tomasz Walkowiak Wrocław University of Science and Technology G4.19 Research Group [email protected] 2017-01-17

Transcript of Language Technology for Polish in...

CLARIN-PL

Language Technology for Polish in Practice G 4.19 Language Technology and Computational Linguistics Research Group and CLARIN-PL Research Infrastructure in brief

Maciej Piasecki, Marcin Pol, Tomasz Walkowiak Wrocław University of Science and Technology

G4.19 Research Group [email protected]

2017-01-17

G4.19 Research Group

§  Location §  Department of Computational Intelligence §  Faculty of Computer Science and Management §  Wrocław University of Science and Technology

§  Subgroups §  Computational Semantics §  Information Extraction §  Language Technology §  Corpus Linguistics §  Polish Lexicography §  Polish-English Lexicography §  Sentiment and Emotional Resources

§  Staff and permanent collaborators: 35

§  http://nlp.pwr.edu.pl

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN Support for Humanities & Social Sciences

§  CLARIN is ERIC type consortium of §  19 members: Austria, Bulgaria, Czech Republic, Denmark,

Estonia, Finland, Germany, Greece, Hungary, Italy, Latvia, Lithuania, The Netherlands, Norway, Poland, Portugal, Slovenia, Sweden and The Dutch Language Union §  … Poland … - founding members

§  1 observer: United Kingdom §  Focus area: supporting research in Humanities and Social

Sciences §  CLARIN Mission

§  To significantly lower the barriers for the use of Language Technology in Humanities & Social Sciences (H&SS)

§  To facilitate or enable research methods based on automated analysis of text and speech resources

http://www.clarin.eu

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Basic Notions

§  Language Technology (LT) §  language resources and tools §  robust in terms of quality and coverage §  multipurpose §  component based

§  Language Technology Infrastructure §  a software framework (architecture or platform) §  for combining language tools with language resources into

processing chains (or pipelines) §  the defined processing chains are next applied to language

data sources §  interoperability, also with the external systems

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

LT in Humanities and Social Sciences: Problems

§  Limited usage of LT in Humanities and Social Sciences §  hard to find: dispersed in the Web, poorly described in a

technical language §  varieties of technological solutions, insufficient users’

computers §  required programming skills or knowledge from the area of

natural language engineering §  LT Infrastructure for H&SS

§  common standards, combined platforms, open approaches §  joint catalogues and search facilities §  focused on H&SS users §  Web Services and Web Applications: no need for installing,

processing focused on H&SS research tasks

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

LT in Humanities and Social Sciences: Barriers

§  Physical – language tools and resources are not accessible in Internet

§  Informational – descriptions are not available or there is no means for searching

§  Technological – lack of commonly accepted standards for LT, lack of a common platform, varieties of technological solutions, insufficient users’ computers

§  Related to knowledge – the use of LT requires programming skills or knowledge from the area of natural language engineering

§  Legal – licences for language resources and tools (LRTs) limit their applications

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN Offer

§  Integration of different LT components into one interoperable system

§  Common, flexible meta-data standard (CMDI) §  Central searching for resources (Virtual Language

Observatory) §  One sign on and one login into the distributed infrastructure §  Common standards: promoting, co-ordinating, harmonising §  Web Services for Language Tools and Resources §  Installation-free Web Applications for research tasks §  Common licences and promotion of the open access

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN: Central Services https://www.clarin.eu/

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN-PL: the Consortium

§  Consortium §  Wrocław University of Science and Technology,

Computational Intelligence Department, G4.19 Research Group §  Institute of Computer Science, Polish Academy of Science §  Polish-Japanese Institute of Information Technology, Chair of

Multimedia §  University of Łódź, PELCRA group at Chair of English Language

and Applied Linguistics §  Institute of Slavic Studies, Polish Academy of Science §  Wrocław University

§  Goal: §  implementation of the Polish part of the CLARIN ERIC LTI

§  CLARIN-PL Language Technology Centre http://ww.clarin-pl.eu

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN-PL Development

§  Bi-directional development of LTI (Piasecki, 2014) §  Bottom-up - development of the necessary basic elements

of LTI §  a distributed network infrastructure §  basic LT processing chain

§  Top-down §  user’s needs è web-based research applications §  close co-operation with key users from the H&SS domain §  amendments to the shape of the technical basis: LRTs,

standards, §  inspirations, identification of the further user needs, next

iterations …

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN-PL: the Consortium

§  Polish scientific consortium §  Wrocław University of Technology, G4.19 Research Group §  Institute of Computer Science, Polish Academy of Science §  Polish-Japanese Institute of Information Technology, Chair of

Multimedia §  University of Łódź, PELCRA group at Chair of English Language

and Applied Linguistics §  Institute of Slavic Studies, Polish Academy of Science §  Wrocław University

§  Goal: implementation of the Polish part of the CLARIN ERIC LTI

§  Follows the bi-directional approach to LTI development

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN-PL Language Technology Centre §  Located in

Wrocław University of Technology §  modified D-Space system

(Lindat, Czech CLARIN) §  One sign-on, one login (Pioneer.id) §  Advanced repository system for language resources

§  Persistent Identifiers for resources and tools §  CMDI meta-data standard (Virtual Language Observatory) §  Interface for Federated Content Search §  depositing service for researchers from H&SS

§  Web Services for LRTs: §  Basic processing chain of Polish §  Prototype system for flexible composition of the natural language

processing chains §  Support for developers SOAP & REST interfaces

§  Web Applications for LRTs §  Knowledge Sharing: expertise and support for the users of LT

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN-PL: Language Resources

1.  Polish Morphological Dictionary 2.  Polish Speech Corpora 3.  Annotated Polish Corpora 4.  Bilingual Corpora 5.  Polish Historical Corpus 6.  Semantic lexicon

§  Wordnet for Polish §  formal description of lexical meanings

7.  Dictionary of Multiword Expressions 8.  Bilingual semantic lexicon 9.  Lexicon of Proper Names 10. Syntactic-semantic Valency Dictionary 11. Robust syntactic-semantic grammar

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN-PL: Language Resources

1.  Polish Morphological Dictionary 2.  Polish Speech Corpora 3.  Annotated Polish Corpora 4.  Bilingual Corpora 5.  Polish Historical Corpus 6.  Semantic lexicon

§  plWordNet 3.0 §  formal description of lexical meanings

7.  Dictionary of Multiword Expressions 8.  Bilingual semantic lexicon 9.  Lexicon of Proper Names 10. Syntactic-semantic Valency Dictionary Walenty 11. Robust syntactic-semantic grammar

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Basic Language Tools for Polish

1.  Segmentation into tokens and sentences 2.  Morphological analysis 3.  Morphological guessing of unknown words (both without context and

context sensitive) 4.  Morpho-syntactic tagging 5.  Word Sense Disambiguation 6.  Chunker and shallow syntactic parser 7.  Named Entity Recognition and disambiguation 8.  Co-reference and anaphora resolution 9.  Temporal expression recognition 10.  Semantic relation recognition 11.  Event recognition 12.  Shallow semantic parser 13.  Deep syntactic parser with disambiguated output: dependency and

constituent 14.  Deep semantic parser

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Basic Language Tools for Polish

1.  Segmentation into tokens and sentences 2.  Morphological analysis 3.  Morphological guessing of unknown words (both without context and

context sensitive) 4.  Morpho-syntactic tagging 5.  Word Sense Disambiguation 6.  Chunker and shallow syntactic parser 7.  Named Entity Recognition and disambiguation 8.  Co-reference and anaphora resolution 9.  Temporal expression recognition 10.  Semantic relation recognition 11.  Event recognition 12.  Shallow semantic parser 13.  Deep syntactic parser with disambiguated output: dependency and

constituent 14.  Deep semantic parser

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Basic Language Tools for Polish

1.  Segmentation into tokens and sentences 2.  Morphological analysis 3.  Morphological guessing of unknown words (both without context and

context sensitive) 4.  Morpho-syntactic tagging 5.  Word Sense Disambiguation 6.  Chunker and shallow syntactic parser 7.  Named Entity Recognition and disambiguation 8.  Co-reference and anaphora resolution 9.  Temporal expression recognition 10.  Semantic relation recognition 11.  Event recognition 12.  Shallow semantic parser 13.  Deep syntactic parser with disambiguated output: dependency and

constituent 14.  Deep semantic parser

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Services - architecture

18

NLPWorkersNLPServices

RESTSOAP

Serwer

NFS

Worker1(WCRFT2)

Worker2(Liner2)

Worker3(WSD)

Workern+1(Serel)

NLPEngine

MonitoringInternalnetwork

G4.19Web

applications

§  Efficiency §  Parallel processing (Walkowiak, 2015) §  Private cloud, scalling §  File indentifieres

on In/Out of tools

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

§ Elastyczność §  złożone potoki przetwarzania §  narzędzia z obszaru

maszynowego uczenia

Web Services - choreography

19

WCRFT LINER2 SEREL

SuperMatrix

WCRFT LINER2 SEREL

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Services in LTC CLARIN-PL

§  Examples of implemented services §  Conversion: any2txt §  Language tools: wcrft2, chunker, chunkrel, serel, liner2, wosedon §  Extraction of feature vectors for texts: Fextor, FextorBis §  Text clustering and classification: stylo, cluto, SVM §  Communication (files, URLs, e-mails), integration with DSpace

§  Ongoing work §  Format converters, monitoring §  Application for concrete research tasks

§  Possible linking other tools §  Virtual machines + simple API §  Re-directering to foreign services (WebLicht, Multiservice)

20

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Integrated enviroment

§  Repository is integrated with Language Tools

§  Simple corpus preprocessing for systems like Inforex

§  One user account for all tools and DSpace

Processing pipeline

WS1 WS2 WS3

D-SPACE

API for Language Tools

Temporary data Resources / data

Request from

DSpace

Inforex

Prepared data

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Services in LTC CLARIN-PL Samsung R&D

Institute Invit. Lecture

2017-01-17

CLARIN-PL

Bi-directional - Top-down Part: First Applications

§  Approaching users §  already active, interested, working on large textual and

speech resources, … §  covering a maximal variety of research areas, e.g. linguistics,

literary studies, psychology, political studies and sociology §  matching the available language tools for Polish §  the first set of several prototype applications illustrating

possibilities and facilitating identification of the needs §  First applications

§  Spokes – searching corpora of conversational data §  A system for collecting Polish text corpora from the Web §  A open textometric and stylometric system focused on Polish §  Semantic text classification for sociology §  Literary Map

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Spokes (University of Łódź) http://spokes.clarin-pl.eu

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

System for Collecting Polish Text Corpora from the Web

§  Results from the gaps in the available technology revealed by the users §  existing corpus building systems were too sensitive to text

encoding errors found in the web §  not designed for informal corpora like blogs

§  A system for collecting Polish text corpora from the Web had to be constructed: §  based on tools from the Masaryk University in Brno §  to detect texts including larger number of errors (by

morphological analysis) §  supports semi-automated extraction of texts from blogs, posts

on forums, etc. §  integrated with tools for processing

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Open Textometric and Stylometric System

§  System designed for characteristic features of Polish §  like rich inflection, weakly constrained word order

§  Based on several existing components including Stylo (Eder & Rybicki)

§  Enabling the use of features defined on any level of the linguistic structure: §  from the level of word forms §  up to the level of the semantic-pragmatic structures.

§  Available as Web Application and a Web Service §  Stylometric techniques appear to be applicable in many

tasks of H&SS §  sociology (characteristic features that are for different

subgroups), political studies (similarity and differences between political parties), literary studies …

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Semantic Text Classification for Sociology

§  Users: Collegium Civitas, Warsaw §  Goal

§  Support for large scale analysis of the source materials §  Automatically annotate documents and text fragments with

pre-defined semantic categories §  Definition of categories by examples §  Automated semantic grouping of documents and text

fragments §  Support for

§  Corpus building §  Manual annotation of the learning sub-corpus §  Automated annotation process §  Statistical analysis of the results

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Conclusions

§  Application of LT to the research in Humanities & Social Sciences seem to be much more challenging than in commercial systems!

§  LT for Polish achieved a stage in which valuable support can be provided for research applications

§  Bi-directional approach combines §  development of the basic, universal set of language tools and

resources §  with inspirations from the research applications

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

Bibliography

§  Piasecki, Maciej (2014) User-driven Language Technology Infrastructure – the Case of CLARIN-PL. In Proceedings of the Ninth Language Technologies Conference, Ljubljana, Slovenia, 2014. http://nl.ijs.si/isjt14/proceedings/isjt2014_01.pdf

§  Pęzik, Piotr (2015) Spokes – a search and exploration service for conversational corpus data. In Selected Papers from the CLARIN 2014 Conference, October 24-25, 2014, Soesterberg, The Netherlands, pp. 99-109, Linköping University Electronic Press, Linköpings universitet, ISBN: 978-91-7685-954-4. http://www.ep.liu.se/ecp/116/009/ecp15116009.pdf

§  Walkowiak, Tomasz (2015) Web based engine for processing and clustering of Polish texts. In Proceedings of the Tenth International Conference on Dependability and Complex Systems DepCoS-RELCOMEX 2015, Springer-Verlag.

Samsung R&D Institute

Invit. Lecture 2017-01-17

CLARIN-PL

CLARIN-PL

Thank you very much for your attention! www.clarin-pl.eu

Supported by the Polish Ministry of Science and Higher Education [CLARIN-PL]