Sinmin Literature Review Presentation
-
Upload
chamila-wijayarathna -
Category
Engineering
-
view
99 -
download
6
Transcript of Sinmin Literature Review Presentation
![Page 1: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/1.jpg)
SINMINCORPUS FOR SINHALA LANGUAGELiterature Review
Upeksha W. D.
Wijayarathna D. G. C. D.
Siriwardena M. P.
Lasandun K. H. L.
Supervisors :
Dr. Chinthana Wimalasuriya
Prof. Gihan Dias
Mr. N. H. N. D. De Silva
![Page 2: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/2.jpg)
Sinmin is a Corpus for Sinhala language which is
➢Continuously updating
➢Dynamic (Scalable)
➢Covers wide range of language (Structured and
unstructured)
![Page 3: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/3.jpg)
OUTLINE
● Literature Review
● Introduction to corpus linguistics and What is a Corpus
● Usages of a corpus
● Existing Corpus Implementations
● Identifying Sinhala Sources and Crawling
● Data Storage and Information Retrieval from Corpus
● Information Visualization
● Extracting Linguistic Feature
● Current Progress
![Page 4: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/4.jpg)
INTRODUCTION TO CORPUS LINGUISTICS
AND WHAT IS A CORPUS
Handford, M. and McCarthy, M. J. (2004) “Invisible to us” - A preliminary corpus based study of spoken
business english, Discourse In the Profession: Perspectives from Corpus Linguistics 167-201
![Page 5: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/5.jpg)
WHAT IS A CORPUS??
“A corpus is a principled collection of authentic texts
stored electronically that can be used to discover
information about language that may not have been
noticed through intuition alone.” - Bennet (2010)
Bennet, G. R. (2010) Using Corpora in the Language Learning Classroom,Michigan ELT.
![Page 6: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/6.jpg)
● There are mainly 8 kinds of corpora.
● They are generalized corpuses, specialized corpuses,
learner corpuses, pedagogic corpuses, historical
corpuses, parallel corpuses, comparable corpuses,
and monitor corpuses.
● The broadest type of corpus is the genarilezed
corpes.
![Page 7: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/7.jpg)
“Sinmin” will
be a generalized corpus.
cover all types of Sinhala Language.
![Page 8: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/8.jpg)
USAGES OF A CORPUS
● Implementing translators, spell checkers and grammar
checkers.
● Identifying lexical and grammatical features of a language.
● Identifying varieties of language of context of usage and
time.
● Retrieving statistical details of a language.
● Providing backend support for tools like OCR, POS Tagger,
etc.
![Page 9: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/9.jpg)
EXISTING CORPUS IMPLEMENTATIONS
![Page 10: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/10.jpg)
● There is a implemented corpus for Sinhala language
which is known as UCSC Text Corpus of
Contemporary Sinhala.
● It consists of about 10 million words, but it covers
very little amount of language and it is not updating.
CORPUS FOR SINHALA LANGUAGE?
![Page 11: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/11.jpg)
COMPOSITION OF THE CORPUS
● Language comprising the corpus cannot be random
but chosen according to specific characteristics.
● It must use authentic texts. The language it contains
is not made up for the sole purpose of creating the
corpus
![Page 12: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/12.jpg)
EXAMPLE - COMPOSITION OF COCA
● The COCA contains more than 385 million words
from 1990–2008 (20 million words each year).
● Texts are evenly divided between 5 genres, spoken
(20%), fiction (20%), popular magazines (20%),
newspapers (20%) and academic journals (20%).
![Page 13: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/13.jpg)
COMPOSITION OF UCSC TEXT CORPUS OF
CONTEMPORARY SINHALA
![Page 14: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/14.jpg)
DATA STORAGE AND INFORMATION
RETRIEVAL FROM CORPUS
Existing corpora uses two main technologies for data
storage
● Relational Databases
● Indexed file Systems
![Page 15: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/15.jpg)
INDEXED FILE SYSTEMS AS STORAGE
● BNC uses this mechanism.
● data is stored as XML like files which follows a
scheme known as the Corpus Data Interchange
Format.
● This supports to store a great deal of detail about the
structure of each text, such as its division into
sections or chapters, paragraphs, verse lines, etc.
![Page 16: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/16.jpg)
![Page 17: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/17.jpg)
RELATIONAL DATABASE AS STORAGE
● COCA, Corpus del Español use relational databases.
![Page 18: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/18.jpg)
DATA MODEL IN COCA
![Page 19: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/19.jpg)
CORPUS DEL ESPAÑOL USES SEPARATE
TABLES FOR BIGRAMS AND TRIGRAMS
![Page 20: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/20.jpg)
RELATIONAL DB VS INDEXED FILE
SYSTEMS
● Indexed file systems use extensive use of indexes
● Relational Database models are relatively fast.
● In Indexed file systems, difficult to add additional
layers of annotation.
![Page 21: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/21.jpg)
No study has been done on
how NoSQL performs in
implementing Corpora.
![Page 22: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/22.jpg)
INFORMATION VISUALIZATION
Most of the popular corpora like BNC, COCA, Corpus
Del Espanol, Google books corpus use similar kind of
Web Interface.
![Page 23: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/23.jpg)
USER INTERFACE
OF COCA
![Page 24: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/24.jpg)
![Page 25: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/25.jpg)
GOOGLE BOOKS NGRAM VIEWER UI
![Page 26: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/26.jpg)
EXTRACTING LINGUISTIC FEATURES
● A main usage of a language corpus is extracting
linguistic features of a language.
● Linguistic features for many languages has been
identified using Corpora.
● Example - A corpus-based linguistics analysis on
written corpus: colligation of “TO” and “FOR.”
![Page 27: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/27.jpg)
![Page 28: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/28.jpg)
![Page 29: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/29.jpg)
CURRENT PROGRESS
![Page 30: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/30.jpg)
IDENTIFIED SINHALA RESOURCES
● Online Newspapers
● News Websites
● School Textbooks
● Sinhala Wikipedia
● Online Mahawansaya
● Subtitles
● Sinhala Fiction
● Sinhala Blogs
● Sinhala Magazines
● Gazette
![Page 31: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/31.jpg)
DIVIDED INTO 5 MAIN GENRES
News Academic Creative
Writing
Spoken Gazette
News Paper Text books Fiction Subtitle Gazette
News Items Religious Blogs
Wikipedia Magazine
mahawansa
![Page 32: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/32.jpg)
Implemented Crawlers for different sources,
adhering to same format.
https://github.com/madurangasiriwardena/corpus.sinhala.crawler
![Page 33: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/33.jpg)
FINISHED CRAWLERS
![Page 34: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/34.jpg)
CRAWLED DATA SAVED TO XML FILES WITH
FOLLOWING META DATA
● Post Name
● Author
● Link
● Published Date
![Page 35: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/35.jpg)
CRAWLER CONTROLLER
Crawler controller monitors and handles the status of
the web crawlers.
Crawler controller address -
http://Sinhala-corpus.projects.uom.lk:8080/CrawlerControllerWeb
![Page 36: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/36.jpg)
![Page 37: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/37.jpg)
We tested performance of several database
systems to determine what should we use
to store data.
![Page 38: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/38.jpg)
WE CONSIDERED FOLLOWING DATA
STORAGE SYSTEMS
![Page 39: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/39.jpg)
We considered performance for inserting
data and for retrieving 12 different
information needs.
Data set and source code -
https://github.com/madurangasiriwardena/performance-test
![Page 40: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/40.jpg)
DATA INSERTION TIME COMPARISON
![Page 41: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/41.jpg)
INFORMATION RETRIEVAL PERFORMANCE
COMPARISON - PART 1
![Page 42: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/42.jpg)
INFORMATION RETRIEVAL PERFORMANCE
COMPARISON - PART 2
![Page 43: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/43.jpg)
Cassandra performed better than others in
most of the scenarios, and its insertion
time increased linearly.
So we chose it for implementing the
corpus.
![Page 44: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/44.jpg)
USER INTERFACE DESIGN AND
IMPLEMENTATION
● Web interface of Sinmin has been designed for users
who would prefer a visualised and summarized view
of statistical data of Sinmin.
● Visual design of the interface has been made in a
way that any user without prior experience of the
interface is able to fulfill his information
requirements with little effort.
http://sinhala-corpus.projects.uom.lk/sinmin-web/
![Page 45: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/45.jpg)
![Page 46: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/46.jpg)
![Page 47: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/47.jpg)
![Page 48: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/48.jpg)
CORPUS API DESIGN AND IMPLEMENTATION
• REST API to expose Corpus services
• Much complex and customizable data retrieval and
filtering
• Interface for third party applications to consume
![Page 49: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/49.jpg)
PUBLICATIONS
● Comparison between performance of various
database systems for implementing a language
corpus - 11th Beyond Databases, Architectures and
Structures conference (Pending)
● Implementing a Corpus for Sinhala Language -
Symposium on Language Technology for South Asia
(Pending)
![Page 50: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/50.jpg)
REMAINING WORK FOR THE NEXT PHASE
• Finish writing crawlers
• Feed data to Cassendra database
• Connecting front end with API calls
![Page 51: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/51.jpg)
Questions?
![Page 52: Sinmin Literature Review Presentation](https://reader034.fdocuments.net/reader034/viewer/2022042716/55a9d63c1a28abf0788b45aa/html5/thumbnails/52.jpg)
Thank you!