CoLiOS - Corpus Linguistic Open Source
-
Upload
marius-corici -
Category
Education
-
view
1.418 -
download
2
description
Transcript of CoLiOS - Corpus Linguistic Open Source
![Page 1: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/1.jpg)
Alexandru-Lucian Gînscă1, Adrian Iftene1, Marius Corîci2
ConsILR Conference, 8-9 December, Bucharest, RomaniaNational Museum of Romanian Literature, (MNLR)
11“Al. I. Cuza”, University of Ia“Al. I. Cuza”, University of Iassi, i, RomRomaaniania11FacultFaculty of Computer Science y of Computer Science
22Intelligentics, Cluj-Napoca, Intelligentics, Cluj-Napoca, RomaniaRomania
![Page 2: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/2.jpg)
Motivation Existing Sentiment Corpora Files Sources Annotations Annotation Process Corpus Statistics Evaluation Metrics Proposal Conclusions
ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 3: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/3.jpg)
Sentiment Analysis or Opinion Mining represents for some time a hot topic within Web 2.0 era.
To build robust systems for Sentiment Analysis, there are needed resources for training and evaluating the systems.
The lack of such a Sentiment Corpus for Romanian.
We intend to make it publicly available, free of charge for individual researchers and research centers.
ConsILR Conference, 8-9 December, MNLR, Bucharest 3
![Page 4: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/4.jpg)
4ConsILR Conference, 8-9 December, MNLR, Bucharest
Existing Sentiment Corpora: MPQA opinion corpus, Large Movie Review Dataset, SentiWordNet, The JDPA Sentiment Corpus, UMass Amherst Linguistics Sentiment Corpora
Languages: English, German, Italian, Chinese, Japanese
![Page 5: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/5.jpg)
5ConsILR Conference, 8-9 December, MNLR, Bucharest
Romanian online publications: Online NewsPapers (MediaFax, Romania Libera, etc) Blogs (Chinezu.eu, Zoso.ro, etc) News Portals (Realitatea.net, StirileProTv.ro, etc)
Category: Telecommunications
Companies: Orange, Vodafone, Cosmote and so on.
![Page 6: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/6.jpg)
6ConsILR Conference, 8-9 December, MNLR, Bucharest
<paragraph id=“”></paragraph>
<sentimentGroup value=“” id_group=“”> </sentimentGroup>
-4 <= value <= 4
<entity type=“” sentiment=“” id_entity=“” id_group=“”></entity> -4 <= value <= 4
![Page 7: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/7.jpg)
7ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 8: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/8.jpg)
8ConsILR Conference, 8-9 December, MNLR, Bucharest
Linking sentiment groups to entities
![Page 9: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/9.jpg)
We consider the following major categories: City, Organization, Company, Country, Person and additionaly we consider categories like Brand, Product and Publication
For almost all major categories we consider subcategories: ◦ For Cities we consider Romanian, European, American and Other
Cities◦ For Organizations we consider Parties, Faculties, Universities,
Ministries, etc.◦ For People we consider Sportsmen, Politicians, Males, Females,
etc.
9ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 10: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/10.jpg)
11 annotators (1st year master students in computational linguistics at FII, UAIC)
As annotation tool we decided to use Serna (http://www.syntext.com/products/serna/) : open source, flexible, easy to use, intuitive
Method 1: process the chosen files with our tools and automatically add annotations for named entities and for sentiments
Method 2: process only at paragraph level
10ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 11: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/11.jpg)
11
![Page 12: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/12.jpg)
12
![Page 13: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/13.jpg)
13
![Page 14: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/14.jpg)
14
![Page 15: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/15.jpg)
11 annotators 1 week span 110 files 1988 paragraphs 2044 sentiment groups 4301 entities 1101 links between entities and sentiment
groups
15ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 16: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/16.jpg)
16ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 17: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/17.jpg)
17ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 18: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/18.jpg)
Sentiment group precision
Precision for named entities and sentiment group links
18ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 19: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/19.jpg)
19ConsILR Conference, 8-9 December, MNLR, Bucharest
Relaxed precision for sentiment group value
CG = the set of correctly identified sentiment groups VF (SSG)= the value of the sentiment group as given by the system VG (SSG)= the value of the sentiment group from the gold file.
![Page 20: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/20.jpg)
20ConsILR Conference, 8-9 December, MNLR, Bucharest
Average deviation for sentiment group value
CG = the set of correctly identified sentiment groups VF (SSG)= the value of the sentiment group as given by the system VG (SSG)= the value of the sentiment group from the gold file.
![Page 21: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/21.jpg)
The importance of a Corpus for Sentiment Analysis for Romanian.
The annotation format and methodology.
Comparison between our proposal and existing Sentiment Corpora.
21ConsILR Conference, 8-9 December, MNLR, Bucharest
![Page 22: CoLiOS - Corpus Linguistic Open Source](https://reader036.fdocuments.net/reader036/viewer/2022081413/5480190cb4af9ffb518b48c2/html5/thumbnails/22.jpg)
22ConsILR Conference, 8-9 December, MNLR, Bucharest