Clear june 2014
-
Upload
simple-groups -
Category
Documents
-
view
224 -
download
0
description
Transcript of Clear june 2014
![Page 1: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/1.jpg)
CLEAR June 2014 1
![Page 2: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/2.jpg)
CLEAR June 2014 2
![Page 3: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/3.jpg)
CLEAR June 2014 3
C
CLEAR June 2014 Volume-3 Issue-2
CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in [email protected] Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Sreejith C Reshma O K Gopalakrishnan G Neethu Johnson Cover page and Layout Sreejith C
Language Computing – A new Computing Arena Elizabeth Sherly 7 Data Completion using Cogent Confabulation Sushma D K 12 Malayalam POS Tagging Using Conditional Random Fields 17 Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V Topic modelling and LDA algorithm Indu M 22 Tamil to Malayalam Transliteration Kavitha Raju, Sreerekha T V, Vidya P V 26 Memory-Based Language Processing And Machine Translation Nisha M 29
Dialect Resolution Manu.V.Nair, Sarath K.S 34
Editorial 4
SIMPLE News & Updates 5 M.Tech Computational Linguistic Project Abstracts (2012-2014) 38 CLEAR Call for Articles 46 Last word 47
![Page 4: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/4.jpg)
CLEAR June 2014 4
Greetings!
This edition of CLEAR is marked by its variety in content and also by contribution
from eminent scholars such as Prof. Elizabeth Sherly of IIITMK. It is heartening to note
that our efforts are attracting wider attention from the academic community. We seek
the readers' suggestions/comments on the previous edition on Indian Language
Computing. As our efforts will be to bring language technology to the mainstream
academics, we plan to include some articles on NLP/CL tools and platforms for
specialized tasks and also on latest computing paradigms like Map Reduce and MBLP
and their relevance to language technology. So, keep reading!
With Best Wishes,
Dr. P. C. Reghu Raj
(Chief Editor)
![Page 5: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/5.jpg)
CLEAR June 2014 5
Publications
Text Based Language Identification System for Indian Languages Following Devanagiri
Script, Indhuja k, Indu M, Sreejith C, Dr. Reghu Raj P C, International Journal of Engineering
Research & Technology (IJERT), Volume. 3 , Issue. 04 , April - 2014, ISSN: 2278-0181
"Eigenvector Based Approach for Sentence Ranking in News Summarization", Divya S,
Reghuraj P C, JCLNLP, April-2014
"A Natural Language Question Answering System in Malayalam Using Domain Dependent
Document Collection as Repository", Pragisha K and P C Reghu Raj, IJCLNLP, April-2014
Box Item Generation from News Articles Based Paragraph Ranking using Vector Space
Model, Sreejith C, Sruthimol M P, P C Reghuraj, International Journal of Scientific Research
in Computer Science Applications and Management Studies ( IJSRCSAMS), Volume 3, Issue 2,
March 2014, ISSN 2319 – 1953
SIMPLE Groups Congratulates all the authors for their achievement!!!
SIMPLE Groups Congratulates all of them for their S for her achievement s !!!
NEWS & UPDATES
![Page 6: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/6.jpg)
CLEAR June 2014 6
Industrial Training at IIITM-K
Virtual Resource Centre for Language Computing(VRC-LC) department of Indian Institute of
Information Technology and Management -Kerala(IIITM-K) had organized a short course and
industrial training on Natural Language Processing exclusively for the M.Tech students of GEC,
Sreekrishnapuram.
The course was focused on recent trends in Computational Linguistics and Machine Learning
related to Malayalam Computing. It was 15 days programme (from 5th May - 20th May). During the
course various reserch scholars and eminent faculties of VRCLC delivered their sessions on various
aspects of Language Processing. The discussion of various works on Malayalam computing
preceedings in their centre gave clear idea on the working and challenges involved in Malayalam
Computing for the participants.
![Page 7: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/7.jpg)
CLEAR June 2014 7
Language Computing: A New Computing
Arena
Elizabeth Sherly,
Professor,
IIITM-Kerala
The great strides made in Information
Technology in every walk out have slowly
developed a prominence in Language
Computing, which has got immense
importance in today's computing world. A
country like India, which values its culture,
heritage and language, is greatly in need of
local language support so as to combat the
dominance of English in computing. India
has had about 25 years of history in language
computing (LC) which has gone through its
ups and down. But during the last decade
there has been a paradigm shift with a
significant leap to LC as a new computing
arena. Scientists who were somewhat
reluctant to take up language computing for
research turned out to choose LT as a
main stream of research and interestingly
industry giants like Microsoft and Google
entered into Language Computing in a big
way. So, a phenomenal shift in LT has
happened as Language Technology tools
become inevitable to enhance the products
and services in high growth markets such as
mobile application, healthcare, IT services,
financial services, online retail, call centers,
publishing and media etc.
Thanks to Noam Chomsky, a
mathematician turned linguist, who
expressed structure of human language to the
level of mathematically viable symbols by
introducing the concepts of TREE diagram.
Then the research in computational
linguistics gained good momentum because
there is a way to represent linguistics in
mathematical and logical form, that enable to
convert a piece of text into a programmer
friendly data structure. As natural language
involves human understanding and
processing, Artificial Intelligence also plays
a significant role in the development of
Computational Linguistics models. Here the
term Language Computing (LC) and
Computational Linguistics (CL) are
sometimes used interchangeably, both are
![Page 8: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/8.jpg)
CLEAR June 2014 8
the key terminologies derived from Natural
Language Processing (NLP).
Some of the major applications in LC
are Machine Translation, Information
Retrieval, Automatic Summarization,
Question Answering, Automatic Speech
Recognition, Language Writing and Spoken
Aids, Dialog Systems, Man-Machine
Interfaces, Knowledge Representations etc.
Machine Translation (MT) is one of the
major tasks in which research has gone back
to the 1950's, but still remains as an open
problem. There are no 100% accurate
Machine Translation systems in any pair of
languages in the world. The day when we
achieve that, most of the other problems in
LT will be resolved. The major tasks
involved in Machine Translation Systems are
to describe the language into syntactic and
Semantic Information. The syntactic
information are generated using
Morphological Analysis, Parts-of-Speech
(POS) tagging, Chunking, and Parsing. The
semantic information are achieved using
Word-Sense Disambiguation (WSD),
Semantic Role Labelling (SRL), Named
Entity Recognition (NER), and Anaphora
Resolution(AR). The primary research in
CL is basically to develop models and tools
for the above mentioned components. The
challenge in this work is that each language
has diverse linguistic nature with varied
morphological features, inflectional sets,
grammatical structure etc. which made
language computing research more complex.
The availability of large corpora and
dictionary in each language is another
constraint for researchers in LC.
In India, there has been a phenomenal
shift in Language Technology or
computational Linguistics research and
development in the last several years as
Language Technology tools become
ineviatable in many applications. The
Department of Electronics and Information
Technology (DEITY),India initiated
Technology Development for Indian
Languages(TDIL) (http://tdil.mit.gov.in)
with the objective of developing Information
Processing Tools and Techniques to facilitate
human-machine interaction without language
barrier; creating and accessing multilingual
knowledge resources; and integrating them
to develop innovative user products and
services. The major activities of TDIL
involves Research and Development of
Language Technology, which includes
machine translation, multi-lingual parallel
corpora, cross lingual information access,
optical character recognition and text to
![Page 9: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/9.jpg)
CLEAR June 2014 9
speech conversion, and development of
standards related to Language Technology
etc. Various projects and research have been
undergoing under TDIL with the effort of
number of scientific organizations and
Educational Institutions. There are
institutions like IITs in India, IIIT-
Hyderabad, IIITM-Kerala, CIIL-Mysore,
Hyderabad Central University, Centre for
Computer Science and Sanskrit Centre-JNU,
New Delhi, CDAC are the main centres
where computational linguistics is taught and
researched.
Apart from Machine Translation, the
research directions in Language Computing
are towards Automatic Speech Recognition,
Speech to Text Processing, Web and
Semantics, Sentiment analysis, Named
Entity Recognition(NER), Anaphora
Resolution, Word-Sense Disambiguation etc.
It is not too futuristic to believe that we
could be talking to the computer which can
act according to our commands. New
techniques and models have to be explored
for better results. Ontology and Semantics,
phycho linguistic analysis are another
upcoming areas relies on research to get
more sensible information from the web and
other applications. The mobile phone being
one of the handy and largely used gadgets
needs local language support for its various
features. This is yet another potential area for
research. Since the Internet of Things (IOT)
is closely associated with Machine -to -
Machine, Machine -to -Human, Machine-to-
small devices, it needs local language
enablement to provide information to the
local masses.
The opportunities in Industry are also
very demanding for CL. One look at the
2013 Gartner predictions tells us that there is
a huge demand for Computational Language
Scientists. Natural Language Processing and
Speech Recognition are no doubt the
prominent considerations for the most
expected technologies in the near future.
Industries that employ computational
linguists are tremendously forging ahead
because currently most of the web, mobile,
and social media based work need extensive
language support. The industry looks for
mainly three roles in Language Computing:
linguists, computer programmers (both
having knowledge in language computing)
and researchers. Researchers in both
Linguistics and Computer Science also have
good opportunities in the industry.
Microsoft, Google, IBM, HP and several
other companies working heavily in this field
needs trained manpower in LT. The global
![Page 10: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/10.jpg)
CLEAR June 2014 10
market value of 19.3 billion in 2011 has been
predicted to shoot to 30 billion in 2015. The
major thrust areas are in Machine
Translation, to create systems and
technologies catering to today‘s multitude of
translation scenarios; multilingual systems,
to develop a natural-language-neutral
approach in all aspects of linguistic
computing; and natural-language processing,
Automatic Speech Recognition, to design
and build software which can analyze,
understand, and generate natural human
languages so as to enable addressing a
computer like addressing a human being.
The work on Malayalam Computing
has been active for the last decade so as to
enable computer to understand and process
Malayalam. The major work in Malayalam
Computing involves Machine Translation
System for Malayalam to another language
and vice versa. Spell Checker, Malayalam
Search Engine, Malayalam text to speech
and speech to text system, Morphological
Analyser and POS taggers for Malayalam,
Malayalam Text Editors, Human Interaction
Interfaces, Malayalam Language tutors,
Corpora building, Dictionary building etc.
CDAC, CDIT, IIITM-K, AUKBAC-
Chennai, CIIL-Mysore, IIIT-Hyderabad,
Amrita are some of the major institutions
actively pursuing research and product
development in Malayalam Computing.
There are also certain NGOs and
Communities actively engaged in work
related to Malayalam Computing.
Despite all the efforts made by many
contributors, the awareness among masses is
very less. A mechanism for encouraging and
promoting Indian Language Computing is
the need of the hour. There should be flag
ship programmes and movements in the
state, in academic institutions and other
organizations to give more visibility and
accessibility which can ensure reachability in
every corner of the nation. Language
Computing and its use, support, and
standardization have to be well placed in
India‘s IT policy, both at the national and
state levels. Each language group shall
actively promote their language and culture,
not only from their own state people, but
from wherever that language group has
existence worldwide. We may take
Malayalam as a language and some of the
promotional activities that can be done at the
state and college levels are listed below.
Include the importance of Malayalam
Computing in Kerala‘s IT Policy.
![Page 11: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/11.jpg)
CLEAR June 2014 11
Create Internet mailing list, say
Malayalam.Net and introduce various
tools and products developed with its
use.
Form a Malayalam Computing Task
Force. Promote and support various
activities and form PMUs for
implementation.
Establish a non-profit,
Government/non Government body
to promote Malayalam Computing.
Promote periodical workshops,
seminars and conferences in
Malayalam Computing in educational
institutions
Make Language Computing
mandatory for Computer Science
courses at the university level
Promote Malayalam Computing
using social media
Publish articles, research papers,
tools and products, news on LC in
magazines, journals and notice
boards.
Due to inherent structure of Malayalam
language as a highly agglutinative and
inflectional language, an issue in Malayalam
Computing is many compared to other
languages. The rendering issues on the
display of Malayalam font require greater
attention for better outlook. The
Computational Models for various tasks
namely Morph Analysis, POS tagging etc are
to be refined for more accurate results. Most
of the models are rule based, statistical or
hybrid models like Support Vector Machine,
HMM and TnT that shows 95-100 %
accuracy. However the Machine Translation
System has shown up to an accuracy of 60-
65% only. Deep Learning has recently
shown many promises for NLP applications.
The deep Neural Network with its capability
to learn distributed representation based on
similarity of words and its ability to learn
multiple level of representations is
encouraging for cognitive problem solving
like natural language understanding and
Speech Recognition. The days of using
Computer with Malayalam input devices and
interfaces with Malayalam Processing and
understanding that the way human handles is
not very far.
![Page 12: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/12.jpg)
CLEAR June 2014 12
Data Completion using Cogent
Confabulation
Sushma D K
M.Tech
Dept. of computer Science
SJBIT,
Bangalore
Introduction
Information is being generated on a
daily basis and continuously uploaded onto
the web inmassive quantity. This information
is of many types ranging from simple text
files to videofiles. This information may
sometimes be incomplete due to various
reasons like variations in electrical signals,
intentional causes, unknowingly overwriting
the data, etc,. Cogent confabulation is a
unique, and alien technique to complete the
missing data. The Cogent confabulation
problem consists of two major tasks: (1)
training the available data to generate
knowledge bases (matrices storing the
appearance counts of words in the training
corpus); (2) querying over this trained data to
predict the next word or phrase, given the
starting few words of the phrase or sentence.
To overcome the difficulty of
completing the missing information, the
previously available data has to be first
studied extensively. Later, this data can be
trained at the semantic level using a
particular model, and used in the completion
of the missing data. Cogent confabulation is
a new model of vertebrate cognition used for
training and querying the available data.
Confabulation is the process of selecting that
one symbol (termed as the conclusion of the
confabulation) whose representing words
happen to be receiving the highest level of
excitation (appearance count). It explains
confabulation product as the probability of
occurrence of the assumed words
individually with the target word. Cogency is
defined as the probability of occurrence of
all the assumed words together with the
target word.
To predict and fill the missing data,
the procedure starts with maximizing the
cogency p(αβγδ|ε) and confabulation product
p(α|ε).p(β|ε).p(γ|ε).p(δ|ε). The order in which
the corpus words are chosen for training is
meant to reflect the pattern of text contained
![Page 13: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/13.jpg)
CLEAR June 2014 13
in the expository text. The approach uses
lexical analyses based on appearance count
of words, an information retrieval
measurement, to determine the extent of
possibility of a particular word being the
conclusion word. Phrase completion, and
sentence continuation using confabulation
model should be useful for many text
analysis tasks, including information
retrieval and summarization, generating
artificial stories etc
Confabulation
Confabulation is a new model of
vertebral cognition which mimics the
Hebbian learning, and the information
processing procedures of human brain.
Cognitive information processing is a direct
evolutionary re-application of the neural
circuits controlling movement, and thus
functions just like movement. Conceptually,
brains are composed of many muscles of
thought (termed thalamocortical modules in
mammals). A module contains symbols, each
of which is a sparse collection of neurons
which functions as a descriptor of the
attribute of that module. For example, if the
attribute of a module is color, then a single
symbol represents a particular color. Each
thalamocortical module is connected to many
other modules. When two symbols are active
simultaneously, they are said to co-occur,
which creates the opportunity to associate
the two symbols. For instance, after seeing a
face and hearing a name together, the
symbols representing each may become
associated. Each strengthened unidirectional
association between two symbols is termed a
knowledge link. Collectively knowledge
links comprise all cognitive knowledge.
Each thalamocortical module performs the
same single information processing
operation, which can be thought of as a
contraction of a list of symbols, termed a
confabulation. Throughout a confabulation,
input excitation is delivered to the module
through knowledge links from active
symbols in other modules lists of candidate
conclusion symbols, driving the activation of
these knowledge links target symbols in the
module performing the confabulation. When
conclusion symbols contract, there is no
physical movement in the brain, rather
symbols currently on the list compete (based
upon their relative excitation levels) for
eventual exclusive activation (a so-called
winner-take-all competition) within that
module and, as a result, the number of active
symbols is gradually reduced.
Crucially, this contraction of the
candidate conclusion symbol list in each
thalamocortical module is externally
![Page 14: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/14.jpg)
CLEAR June 2014 14
controlled by a thought control signal
delivered to the module. A confabulation in a
thalamocortical module is controlled by a
graded analog control input. The thought
control signal determines how many symbols
remain in the competition, but has no effect
on selecting which symbols are in the
competition. Which symbols are in the
competition is determined by the excitation
level of a symbol as it dynamically reacts to
knowledge link input from active symbols in
other modules (which cause its excitation
level to increase) or to a reduction or
cessation of such input (which causes its
excitation level to fall). Ultimately, the
thought control signal is used to dynamically
contract the number of active symbols in a
module from an initial many less-active
symbols to, at the end of the confabulation, a
single maximally-active symbol. The
resulting single active symbol is termed the
confabulation conclusion. The learned
association between each symbol of a
module and its set of action commands is
termed skill knowledge. Skill knowledge is
stored in the module, but the learning of
these associations is controlled by
subcortical brain nuclei. When a conclusion
is reached in a module, those action
commands which have a learned association
from that conclusion symbol are instantly
launched. These issued action commands are
proposed as the source of all non-reflexive
and non-autonomic behaviors.
Thalamocortical modules performing
confabulations, delivering excitation through
knowledge links, and applying skill
knowledge through the issuance of action
commands constitute the complete
foundation of all mammalian cognition.
These basic concepts of cognition, and
confabulation are applied in the artificial
intelligence, and machine learning field to
predict, fill, and complete the missing data in
the sentences. It can also be used for context
representation, context exploitation, itelligent
text recognition, and to generate artificial
stories
Methodology and applications
One of the methodologies that are used for
data completion is Confabulation, which is
explained before. The other concept is
cogency. It is the occurrence of all the
assumed fact words together with the
assumed target word i.e. p(αβγδ|ε).
A. Terminologies:
According to the confabulation model,
assuming that the combined assumed facts
αβγδ are true, then the set of all symbols (in
the answer lexicon from which conclusions
![Page 15: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/15.jpg)
CLEAR June 2014 15
are being sought) with p(αβγδ|ε) > 0 is called
the expectation; the elements of which, are
termed answers, in the descending order of
their cogencies. Confabulation
(maximization of the product
p(α|ε).p(β|ε).p(γ|ε).p(δ|ε) or, equivalently, the
sum of the logarithms of these probabilities)
as a surrogate for maximizing cogency (it is
assumed that all required pairwise
conditional probabilities p(Ψ|λ) between
symbols Ψ and λ are known. This
assumption is termed exhaustive
knowledge). Each meaningful non-zero
p(Ψ|λ) is termed an individual item of
knowledge.
B. Applications
These basic concepts of cognition and
confabulation are applied in the artificial
intelligence, and machine learning field to
predict, fill, and complete the missing data in
the sentences. It can also be used for context
representation, context exploitation,
intelligent text recognition, and to generate
artificial stories.
Conclusion and Future Works
This paper is an initiative for better
data completion, and manipulation. The
major requirement for the process is the
previous complete text corpus. The previous
works fill in the missing data based on the
Bayesian models, where the words are
selected by increasing their posterior
probability (selecting the conclusion which
has the highest probability of being correct).
Though Bayesian theory has been
extensively used in neural networks, it is an
awful model for cognition, and data
completion. Because, it simply selects a
word with highest probability value, even if
it is irrelevant to the given context. Many
open source tools can be used for the easy
development. These basic concepts of
cognition, and confabulation are applied for
context representation, context exploitation,
intelligent text recognition, and to generate
artificial stories. The longer execution times
due to lexicon overhead can be reduced by
parallel processing. Applications that
demand high throughput will have to
evaluate the proposed confabulation method
depending on the hardware available.
REFERENCES
[1] Robert Hecht-Nielsen, Confabulation
Theory, The Mechanism of Thought 2007.
[2] Qinru Qiu , Qing Wu, Daniel J. Burns,
Michael J. Moore, Robinson E. Pino,
Morgan Bishop, and Richard W. Linderman,
Confabulation Based Sentence Completion
for Machine Reading, IEEE, 2011.
![Page 16: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/16.jpg)
CLEAR June 2014 16
[3] Qinru Qiu, Qing Wu, and Richard
Linderman, Unified Perception-Prediction
Model for Context Aware Text Recognition
on a Heterogeneous Many-Core Platform,
Proceedings of International Joint
Conference on Neural Networks, San Jose,
California, USA, July 31 – August 5, 2011
[4] Fan Yang 1 , Qinru Qiu , Morgan Bishop
, and Qing Wu, Tag-assisted Sentence
Confabulation for Intelligent Text
Recognition, IEEE, 2012
[5] Darko Stipaničev, Ljiljana Šerić, Maja
Braović, Damir Krstinić, Toni
Jakovčevic, Maja Štula, Marin Bugarić and
Josip Maras, Vision Based Wildfire and
Natural Risk Observers, IEEE, 2012.
Computer chatbot 'Eugene Goostman' passes the Turing test
Eugene Goostman is a chatterbot. First developed by a group of three programmers; the
Russian-born Vladimir Veselov, Ukranian-born Eugene Demchenko, and Russian-born Sergey
Ulasen in Saint Petersburg in 2001,Goostman is portrayed as a 13-year old Ukranian boy—a
trait that is intended to induce forgiveness in users for his grammar and level of knowledge.
The Goostman bot has competed in a number of Turing test contests since its creation, and
finished second in the 2005 and 2008 Loebner Prize contest. In June 2012, at an event marking
what would have been the 100th birthday of their namesake, Alan Turing, Goostman won what
was promoted as the largest-ever Turing test contest, successfully convincing 29% of its judges
that it was human. On 7 June 2014, at a contest marking the 60th anniversary of Turing's death,
33% of the event's judges thought that Goostman was human; the event's organizer Kevin
Warwick considered it to have "passed" Turing's test as a result, per Turing's prediction that by
the year 2000, machines would be capable of fooling 30% of human judges after five minutes of
questioning.
![Page 17: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/17.jpg)
CLEAR June 2014 17
Malayalam POS Tagging Using Conditional
Random Fields
Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V
Dept. of Computer Science and Engg,
Sreepathy Institute of Management and Technology, Vavanoor
I. Introduction
India is a large multi lingual country of
diverse culture. It has many languages with
written forms and over a thousand spoken
languages. The Constitution of India
recognizes 22 languages, spoken in different
parts of the country. The languages can be
categorized into two major linguistic families
namely Indo Aryan and Dravidian. These
classes of languages have some important
differences. Their ways of developing words
and grammars are different. But both include
a lot of Sanskrit words. In addition, both
have a similar construction and phraseology
that links them together. There is a need to
develop information processing tools to
facilitate human machine interaction ,in
Indian Languages and multi lingual
knowledge resources. A POS tagger forms
an integral part of any such processing tool
to be developed. Parts of Speech Tagging, a
grammatical tagging, is a process of marking
the words in a text as corresponding to a
particular part of speech, based on its
definition and context- i,e relationship with
adjacent and related words in a phrase,
sentence ,or paragraph. This is the first step
towards understanding any languages. It
finds its major application in the speech and
Parts of speech tagging is the process of assigning tags to the words of a given sentence. This paper presents
the building of Part-Of-Speech (POS) tagger for Malayalam Language using Conditional Random Field
(CRF). POS tagger plays an important role in Natural language applications like speech recognition,
natural language parsing, information retrieval and information extraction. The present tagset consists of
100 tags.It consists of a language model, which is trained by an annotated corpus of 3026 sentences (36,315
words). This model checks the trigram possibility of occurrence of tag in the training corpus. We present a
trigram HMM-based (Hidden Markov Model) part-of-speech (POS) tagger for Malayalam language, which
will accept a raw text to produce a POS tagged output. We can improve the accuracy of the system by
increasing the size of the annotated corpus. Although the experiments were performed on a small corpus,
the results show that the statistical approach works well with a highly agglutinative language like
Malayalam..
![Page 18: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/18.jpg)
CLEAR June 2014 18
NLP like Speech Recognition, Speech
Synthesis, Information retrieval etc. A lot of
work has been done relating to this in NLP
field. A lot of work has been done in part of
speech tagging of western languages. These
taggers vary in accuracy and also in their
implementation. A lot of techniques have
also been explored to make tagging more and
more accurate. These techniques vary from
being purely rule based in their approach to
being completely stochastic. Some of these
taggers achieve good accuracy for certain
languages. But unfortunately, not much work
has been done with regard to Indian
languages especially Malayalam. In this
paper we have developed a POS tagger based
on Conditional Random Field (CRF).
Conditional Random Fields (CRFs), the
undirected graphical models, are used to
calculate the conditional probability of
values on designated output nodes given
values on other designated input nodes. It
can clearly be seen that generative model
(HMM) here has performed quite close to
CRF. A Hidden Markov Model (HMM) is a
statistical model in which the system
modeled is thought to be a Markov process
with the unknown parameters. In this model,
the assumptions on which it works are the
probability of the word in a sequence may
depend on its immediate word preceding it
and both the observed and hidden words
must be in a sequence. It has been
experimentally shown that the accuracy of
the POS tagger can be improved
significantly by introducing a trigram
template an efficient corpus and a widely
accepted tagset. This paper mainly
concentrates on designing a POS Tagger,
using CRF++, which is an open source
implementation of CRF.
System Architecture
The system consists of 3 modules
namely Preprocessing, Training and Testing.
The architecture of the proposed system is
depicted in Fig.1.
Fig 1. System Architecture
Preprocessing is the initial stage for
the implementation of a Malayalam POS
Tagger. In preprocessing it takes the input
sentence and tokenize the sentence,i.e it
![Page 19: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/19.jpg)
CLEAR June 2014 19
receives the input as sentence and split it into
words. Thus splitted words are stored in a
file and it becomes the input for the testing
phase.
The proposed method uses a
supervised mode of learning for POS
tagging. The simplest statistical approach
finds out the most frequently used tag for a
specific word from the annotated training
data and uses this information to tag that
word in the unannotated text. This is done
during the training phase. We are using
statistical approach for POS tagging i.e. We
train and test our model for this we have to
calculate frequency and probability of words
of given corpus.
During training, the annotated corpus
is trained using the CRF Template. All the
required files in performing the procedure is
explained with example below. For training,
the CRF Learn command is used as shown
below:
crf learn template file train file model file
The template file provides the correct
position of the words to be tagged .If a
unigram template is used, then a word is
tagged based on the current word. If Bigram
statistics is used, then the tagging is done
based on the current and previous word. On
increasing the number of grams, the template
becomes more useful and tagging becomes
more efficient. The template file can be
represented as shown in Fig. 2.
Fig. 2 CRF Template
The tagset used in the system is
developed by Bureau of Indian Standards.
Tagset is a list of short forms representing
the components of sentence like nouns, verbs
and their sub forms. The corpus training is
performed with the help of the tag set. The
tag set used in the system contains 100 tags.
The training will create a model file as
output, which contain the learned
probabilities of the corpus.
![Page 20: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/20.jpg)
CLEAR June 2014 20
Fig.3 Training Corpus
Testing is the process of comparing
the trained model file with the tokenized
input file and obeying rules in template and
tags in tagset finding out the corresponding
tags of each and every words. Use crf test
command:
crf learn template file train file model file
A screenshot of the proposed system
is given in Fig.4.
Fig 4: Screenshot
Conclusion
The motivation of this project is to
help children and foreigners to learn the
structure of a Malayalam sentence. By using
Python programming language as the
development environment the application is
built keeping in mind about the design
standards and maintainability of the code.
CRF++, by using the Hidden Markov Model
and Tkinter features provide a rich user
![Page 21: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/21.jpg)
CLEAR June 2014 21
experience to the users of the software. This
application is very simple to use and is
helpful to people who are in the preliminary
stage of learning Malayalam language.
References
1. Rajeev R R, Jisha P Jayan, "Part of speech
tagger for malayalam", Computer
Engineering and Intelligent Systems, Vol 2,
No.3 June 2011.
2. Christopher.D.Manning,Hinrich Shutze, "
Fundamentals of Statistical Natural
Language Processing", MIT Press, 1999.
3. Asif Ekbal,Rajwanul Haque,Sivaji
Bandhopadhyay,‖Bengali Part Of Speech
Tagging using Conditional Random
Fields‖,Department of CSE Javadpur
University Kolkata-70032,India
4. Anish, ‖Part of Speech Tagging for
Malayalam‖, Amritha School of Engineering
Coimbatore, Amritha Vishwa
Vidyapeetham,Coimbatore-641105
5. Steven Bird, et al, Natural Language
Processing with Python, Orielly
Publications, 2011.
6. Michael Collins,‖Tagging with Hidden
Markov Model ‖, Lecture Notes.
7. CRF++ Home Page,
https://code.google.com/p/crfpp/, Last visited
on March 2014.
Making the world’s knowledge computable
Wolfram Alpha introduces a fundamentally new way to get knowledge and answers not by
searching the web, but by doing dynamic computations based on a vast collection of built-in data,
algorithms, and methods.
http://www.wolframalpha.com/
![Page 22: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/22.jpg)
CLEAR June 2014 22
Topic modelling and LDA algorithm
Indu M
Dept. of computer Science Govt. Engineering College,
Sreekrishnapuram
Introduction
As electronic documents become
available, it becomes more difficult to find
and discover what we need. We need new
tools to help us organize, search, and
understand these vast amounts of
information.
Topic models are based upon the idea
that documents are mixtures of topics, where
a topic is a over words. A topic model is a
generative model for documents: it specifies
a simple probabilistic procedure by which
where a topic is a documents can be
generated. To make a new document, one
chooses a distribution over topics. Then, for
each word in that document, one chooses a
topic at random according to this
distribution, and draws a word from that
topic.
Probabilistic topic models are a suite
of algorithms whose aim is to discover the
hidden thematic structure in large archives
of documents. One of the simplest topic
models is latent Dirichlet location (LDA).
The intuition behind LDA is that documents
exhibit multiple topics. Most topic models,
such as LDA [1] are unsupervised. Only the
words in the documents are modeled, and the
goal is to infer topics that maximize the
likelihood (or the posterior probability) of
the collection.
The main application of topic
modeling is in Information Extraction. And it
can also be used to analyze, summarize, and
categorize the stream of text data at the time
of its arrival. For example, as news arrives in
streams,organizing it as threads of relevant
articles is more efficient and convenient.
Topic modelling programs do not
know anything about the meaning of the
words in a text. Instead, they assume that any
piece of text is composed (by an author) by
A topic model that which automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. It is used to check models, summarize the corpus, and guide exploration of its contents. Topic Modelling can enhance information network construction by
grouping similar objects, event types and roles together.
![Page 23: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/23.jpg)
CLEAR June 2014 23
selecting words from possible baskets of
words where each basket corresponds to a
topic. If that is true, then it becomes possible
to mathematically decompose a text into the
probable baskets from whence the words
first came.
To make a new document in topic
modelling, one chooses a distribution over
topics. Then, for each word in that
document, one chooses a topic at random
according to this distribution, and draws a
word from that topic.
The model specifies the following
distribution over words within a document:
where T is the number of topics, p(z) for the
distribution over topics z in a particular
document and p(w|z) for the probability
distribution over w words given z topic.
LDA
The basic ideas behind latent Dirichlet
allocation (LDA), which is the simplest topic
model [3]. The intuition behind LDA is that
documents exhibit multiple topics. For
example, consider the article in Figure 1[1].
This article, entitled ―Seeking Life's Bare
(Genetic) Necessities," is about using data
analysis to determine the number of genes
that an organism needs to survive.
In this it is highlighted the different
words that that are used in the article. Words
about data analysis, such as ―computer" and
―prediction," are highlighted in blue; words
about evolutionary biology, such as ―life"
and ―organism", are highlighted in pink;
words about genetics, such as ―sequenced"
and ―genes," are highlighted in yellow. Here
in this figure at left side, this represents some
no of topics. Each document is assumed to
be generated as follows. First choose a
distribution over the topics (the histogram at
right); then, for each word, choose a topic
assignment (the coloured coins) and choose
the word from the corresponding topic.
LDA models can be used to find
topics that describe a corpus and each
document exhibit multiple topics.
Graphical model of LDA
![Page 24: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/24.jpg)
CLEAR June 2014 24
Methodology and applications
One of the methodologies that are used for
topic detection is LDA, which I have
explained before. Theoretical studies of topic
modelling focus on learning the model‘s
parameters assuming the data is actually
generated from it. Existing approaches for
the most part rely on Singular Value
Decomposition (SVD), and consequently
have one of two limitations: these works
need to either assume that each document
contains only one topic, or else can only
recover the span of the topic vectors instead
of the topic vectors themselves. Here in
SVD, we assume that each document contain
only one topic or else span of topic vector
instead of topic vector themselves.
Other probabilistic models such as naive-
Bayes Dirichlet Compound Multinomial
(DCM) mixtures, probabilistic Latent
Semantic Indexing (pLSI), were also used to
find the relevant topics. We can also simplify
the topic distribution by modelling each
topic as a discrete probability over
documents.
A.Algorithm
The corpus contains a collection of
documents.
For each document in the collection, we
generate the words in a two-stage process.
1. Randomly choose a distribution over
topics
2. For each word in the document:
(a) Randomly choose a topic from
the distribution over topics in step
#1.
(b) Randomly choose a word from
the corresponding distribution
over the vocabulary.
This statistical model rejects the
intuition that documents exhibit multiple
topics. Each document exhibits the topics
with different proportion (step #1); each
word in each document is drawn from one of
the topics (step #2b), where the selected
topic is chosen from the per-document
distribution
Here in order to evaluate the
predictive power of a generative model on
unseen data, there is a standard way known
![Page 25: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/25.jpg)
CLEAR June 2014 25
as perplexity. By this, it is able to predict
relate words from different languages.
B. Applications
Topic models are good for data
exploration, when there is some new data set
and you don't know what kinds of structures
that you could possibly find in there. But if
you did know what structures you could find
in your data set, topic models are still useful
if you didn't have the time or resources to
construct classification models based on
supervised machine learning. Lastly, if you
did have the time and resources to construct
classification models based on supervised
learning, topic models would still be useful
as extra features to add to the models in
order to increase their accuracy. This is the
case because topic models act as a kind of
"smoothing" that helps combat the sparse
data problem that is often seen in supervised
learning.
The topic modeling can be used in
computer vision, an inference algorithm
applied to natural texts in the services of text
retrieval, classification, an d organization
build text hierarchy etc. it can also be used
in WSD, machine learning, organize,
summarize, and help users to explore large
corpora, information engineering
applications, scientific applications such as
genetic/ neuro science etc.
Conclusion and Future Works
Documents are partitioned into
topics, which in turn have terms associated
to varying degrees. However in practice,
there are some clear issues: the models are
very sensitive to the input data small changes
to the stemming/ tokenization algorithm can
result in completely different topics; topics
need to be manually categorized in order to
be useful; topics are ―unstable ― in the sense
that adding new document can cause
significant change to the topic distribution.
REFERENCES
[1] D. Blei. Introduction to probabilistic
topic models, Communications of the ACM
pp. 77-84 2012.
[2] Vivi Nastase, Introduction to topic
models: Building up towards LDA, summer
semester 2012
[3] D. Blei. Et. al. supervised topic models,
Princeton university
[4] David Hall et al. Studying the History of
Ideas Using Topic Models, Stanford
University.
![Page 26: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/26.jpg)
CLEAR June 2014 26
Tamil to Malayalam Transliteration
Kavitha Raju, Sreerekha T V, Vidya P V
M.Tech CL
GEC, Sreekrishnapuram
Palakkad
I. Introduction
The rewriting or conversion of the characters
of a text from one writing system to another
writing system is called transliteration. Here
each character of the source language is
assigned to a different unique character of
the target language, so that an exact
inversion is possible. If the source language
consists of more characters than the target
language, combinations of characters and
diacritica can be used.
Machine transliteration systems have a great
importance in a country like India which has
a fascinating diversity of languages. Even
though there are groups of language that
comes from common origin, the difference in
scripts makes the task cumbersome.
Machine transliteration systems can be
classified into rule-based and corpus-based
approaches, in terms of their core
methodology. Rule-based systems develop
linguistic rules that allow the words to be put
in different places, to have different scripts
depending on context, etc. The main
approach of these systems is based on
linking the structure of the given input with
the structure of the demanded output,
necessarily preserving their unique meaning.
Minimally, to get a transliteration of source
language sentence one needs a dictionary
that will map each source language word to
an appropriate target language word, rules
representing source language word structure
and rules representing target language word
structure. Because rule-based machine
transliteration uses linguistic information to
mathematically break down the source and
target languages, though, it is more
predictable and grammatically superior than
Statistical Method. Statistical machine
transliteration employs a statistical model,
based on the analysis of a corpus, to generate
text in the target language. The idea behind
statistical machine transliteration comes
from information theory. A document is
transliterated according to the probability
distribution p(e|f ) that a string e in the target
language is the transliteration of a string f in
the source language.
Abstract : Transliteration can form an essential part of transcription which converts text from one
writing system to another. This article discusses about the applications and challenges in machine
transliteration from Tamil to Malayalam, two languages that belong to the Dravidian family.
Transliteration can be used to supplement machine translation process by handling the issues that can
happen due to the presence of named entities.
![Page 27: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/27.jpg)
CLEAR June 2014 27
There are two primary motivations behind
pursuing this project. First one is that in
India, majority of the population still use
their mother-tongue as the medium of
communication and the next is that in spite
of globalization and wide-spread influence of
the West in India, most of the people still
prefer to read, write and talk in their mother-
tongue.
II. Language Features
Malayalam is a language spoken in India,
predominantly in the state of Kerala. It is one
of the 22 scheduled languages of India and
was designated a Classical Language in India
in 2013. Malayalam has official language
status in the state of Kerala and in the union
territories of Lakshadweep and Puducherry.
It belongs to the Dravidian family of
languages. The origin of Malayalam,
whether it was from a dialect of Tamil or an
independent offshoot of the Proto Dravidian
language, has been and continues to be an
engaging pursuit among comparative
historical linguists.
Tamil is a Dravidian language spoken
predominantly by Tamil people of South
India and North-east Sri Lanka. It has official
status in the Indian states of Tamil Nadu,
Puducherry and Andaman and Nicobar
Islands. Tamil is also an official language of
Sri Lanka and an official language of
Singapore. It is legalised as one of the
languages of medium of education in
Malaysia along with English, Malay and
Mandarin. It is also chiefly spoken in the
states of Kerala, Karnataka, Andhra Pradesh
and Andaman and Nicobar Islands as one of
the secondary languages. It is one of the 22
scheduled languages of India and was the
first Indian language to be declared a
classical language by the Government of
India in 2004.
Both Tamil and Malayalam are languages of
southern states of India. Even though
Malayalam is said to belong to Dravidian
family of languages,it is more similar to the
Arya language. When Aryans came at north-
east border of Bharatam, the people of
Harappa and Mohen Ja Daro moved to east
and south and they replanted their
civilisation in South India, based in Tamil
Nadu. Tamil is one of the oldest languages.
When Dravidians moved to south, there were
people of soil to receive them and help. The
geography helped them to receive Tamil.
Dravidians settled in Tamil Nadu and
developed the Tamil Literature and Tamil
Civilization.
In comparison to all Indian Languages,
Tamil has only 12 vowels and 18 consonants.
Malayalam is one of the most updated
languages with clarity in voice and
comparatively more difficult to study. There
are 56 letters. The vowels are most
resembling to Sanskrit. Since Tamil has only
few alphabets in comparison to Malayalam,
it is not possible to have a one to one
mapping between these languages. A letter in
Tamil can be transliterated as more than one
letters in Malayalam.
III. Applications and Challenges
The various applications of machine
transliteration system are as follows:
• It aids machine translation.
• It helps to eliminate language barriers.
![Page 28: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/28.jpg)
CLEAR June 2014 28
• It supports localization.
• It enables the use of a keyboard in a
given script to type in a text in another one.
Transliteration also has many challenges.
They are:
• Dissimilar alphabet sets of the source
and target languages.
• Multiple possible transliterations for
the same word.
• Finding exactly matching tokens is
difficult for some of the vowels and a few
consonants.
• The size of the corpus required is
very large in order to build an accurate
transliteration system.
IV. Conclusion
In this article, we discussed about Tamil to
Malayalam transliteration systems in general.
Various applications of transliteration and the
challenges associated with it were also
pointed out. It is inevitable for a machine
translation system for handling named
entities. In case of a dictionary based
translation systems, this is very useful as it
would save lot of time and resources. But it
has been observed that a large corpus size is
required to model the system accurately.
REFERENCES
[1] K Saravanan, Raghavendra Udupa, A
Kumaran: Crosslingual Information
Retrieval System Enhanced with
Transliteration Generation and Mining,
Microsoft Research India, Bangalore, INDIA.
[2] Rishabh Srivastava and Riyaz Ahmad
Bhat: Transliteration Systems Across Indian
Languages Using Parallel Corpora,
Language Technologies Research Center,
IIIT-Hyderabad, India.
[3] R. Akilan and Prof. E.R.Naganathan:
Morphological Analyzer for Classical Tamil
texts: A rule-based approach
![Page 29: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/29.jpg)
CLEAR June 2014 29
Memory-Based Language Processing And
Machine Translation
Nisha M
M.Tech CL
GEC, Sreekrishnapuram
Palakkad
I. Introduction
Memory-based language processing,
MBLP, is based on the idea that learning and
processing are two sides of the same coin.
Learning is the storage of examples in
memory, and processing is similarity-based
reasoning with these stored examples.
MBLP finds its computational basis
in the classic k-nearest neighbor classifier
(Cover & Hart, 1967). With k = 1, the
clasfier searches for the single example in
memory that is most similar to B, say A, and
then copies its memorized mapping A‘ to B‘
(as visualized schematically in Figure 1).
With k set to higher values, the k nearest
neighbors to B are retrieved, and some
voting procedure (such as majority voting)
determines which value is copied to B‘.
Memory-based machine translation
can be considered as a form of Example-
Based Machine Translation. In
characterizing Example-Based Machine
Translation, Somers (1999) refers to the
Memory-based language processing (MBLP) is an approach to language processing based on exemplar
storage during learning and analogical reasoning during processing. From a cognitive perspective, the
approach is attractive because it does not make any assumptions about the way abstractions are
shaped, and does not make any a priori distinction between regular and exceptional exemplars,
allowing it to explain fluidity of linguistic categories, and irregularization as well as regularization in
processing. Memory-based machine translation can be considered as a form of Example-Based
Machine Translation. Machine translation problem can be treated as a classification problem and
hence memory based learning can be applied. This paper demonstrates memory-based approach for
machine translation.
![Page 30: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/30.jpg)
CLEAR June 2014 30
common use of a corpus or database of
already translated examples, and the process
of matching new input against instances in
this database. This matching is followed by
extraction of fragments which are then again
recombined to form the final translation.
The task of mapping one language to
the other can be treated as a classification
problem. The method can be described as
one in which the sentence to be translated is
decomposed into smaller fragments, each of
which is passed to a classifier for
classification. The memory-based classifier
is trained on the basis of a parallel corpus in
which the sentence pairs have been reduced
to smaller fragment pairs. The assigned
classes thus correspond to the translation of
the fragments. In the final step, these
translated fragments are re-assembled to
derive the final translation of the input
sentence.
II. Background and Related Literature
Natural language processing models
and systems typically employ abstract
linguistic representations(syntactic,
semantic, or pragmatic) as intermediate
working units. Memory-based models enable
asking the question whether we can do
without them, since any invented
intermediate structure is always implicitly
encoded somehow in the words at the
surface, and the way they are ordered, and
memory-based models may be capable of
capturing the knowledge that is usually
considered to be necessary, in an implicit
way, so that they do not need to be explicitly
computed.
Classes of natural processing tasks in
which this question can be investigated in
extreme are processes in which form is
mapped to form, i.e., in which neither the
input nor the output contains abstract
elements to begin with, such as machine
translation. Many current machine
translation tools, such as the open source
Moses toolkit (Koehn et al., 2007), indeed
implement a direct mapping of source to
target text, leaving all of syntax and
semantics implicit; they hide in the form of
statistical translation models between col
locationally strong phrases, and of statistical
language models of the target language.
MBLP approach on this problem involves
using context on the source side, and using
memory-based classification as a translation
model (Van Gompel, Van den Bosch, &
Berck,2009).
There is an encouraging number of
recent studies that attempt to link statistical
![Page 31: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/31.jpg)
CLEAR June 2014 31
and memory-based models of language that
focus on discovering strong n-grams (for
phrase-based statistical machine translation
or for statistical language modeling) to the
concept of constructions and to the question
to what extent human language users exploit
constructions. To mention two, we note that
Mos, Van den Bosch, and Berck have
reported that a memory-based language
model shows a reasonable correlation with
unit segmentations that test subjects generate
in a sentence copy task; the model implicitly
captures several strong complex lexical items
(constructions), although it fails to capture
long distance dependencies, a common issue
with local n-gram based statistical models. In
a related study, (Arnon & Snider,2010) show
that subjects are sensitive to the frequency (a
rough approximation of collocational
strength) of four-word n-grams such as
‗don‘t have to worry‘, which are processed
faster when they are more frequent. Their
argument is again focused on the question
whether strong subsequences need to have
linguistic structure that assume hierarchy, or
could simply be taken to be flat n-grams — it
is exactly this question that we aim to
explore further in our work with memory
based language processing models.
III. Methodology
The process of translating a new
sentence is divided into a local phase
(corresponding to the first two steps in the
process) in which memory-based
translation of source trigrams to target
trigrams takes place, and a global phase
(corresponding to the third step) in
which a translation of a sentence is
assembled from the local predictions.
A.Local classification
Both in training and in actual
translation, when a new sentence in the
source language is presented as input, it is
first converted into windowed trigrams,
where each token is taken as the center of a
trigram once. First trigram of the sentence
contains an empty element, and the last
trigram contains an empty right element. At
training time, each source language sentence
is accompanied by a target language
translation. Word alignment should be
performed before this step, so that classifier
know for each source word whether it maps
to a target word, and if so, to which. Given
the alignment, each source trigram is mapped
to a target trigram of which the middle word
is the target word to which the word in the
middle of the source trigram aligns and right
neighboring words of the target trigram are
![Page 32: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/32.jpg)
CLEAR June 2014 32
the center word‘s actual neighbors in the
target sentence.
Figure2: an example training pair of
sentences, converted into six
overlappingtrigrams with their aligned
trigram translations.[1]
When translating new text, trigram outputs
are generated for all words in each new
source language sentence to be translated,
since our system does not have clues as to
which words would be aligned by statistical
word alignment.
B.Global search
To convert the set of generated target
trigrams into a full sentence translation, the
overlapbetween the predicted trigrams is
exploited. Figure 3 illustrates a perfect case
of a resolutionof the overlap (drawing on the
example of Figure 2), causing words in the
English sentence tochange position with
respect to their aligned Dutch counterparts.
First three English tri-grams align one-to-one
with the first three Dutch words. Fourth
predicted English trigram, however, overlaps
to its left with the fifth predicted trigram, in
one position, and overlaps in two positions to
the right with the sixth predicted trigram,
suggesting that this part of the English
sentence is positioned at the end. Note that in
this example, the ―fertility‖ words take and
this, which are not aligned in the training
trigram mappings (cf. Figure 1), play key
roles in establishing trigram overlap.
IV. Conclusion and Future Works
The study described in this paper has
demonstrated how memory-based learning
can be applied to machine translation.
Memory based learning stores all examples
in memory it settles for a state of maximal
description length. This extreme bias makes
memory-based learning an interesting
comparative case against so-called eager
learners, such as decision tree induction
algorithms and rule learners. Phrase based
![Page 33: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/33.jpg)
CLEAR June 2014 33
memory based machine translation is also
implemented which use Statistical toolkits
such as Moses for phrase extraction. Future
works include developing memory based
techniques for phrase extraction.
REFERENCES
[1] A. van den Bosch and P. Berck. Memory-
based machine translation and language
modeling. The Prague Bulletin of
Mathematical Linguistics, 91:17–26, 2009.
[2] Antal van den Bosch and Walter
Daelemans, Memory-Based Language
Processing, Cambridge University Press,
New York
[3] Antal van den Bosch and Walter
Daelemans, Implicit schemata and categories
in memory-based language processing
The Stanford Natural Language Processing Group
The Natural Language Processing Group at Stanford University is a team of faculty, research
scientists, postdocs, programmers and students who work together on algorithms that allow
computers to process and understand human languages. The work ranges from basic research in
computational linguistics to key applications in human language technology, and covers areas such
as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical
information extraction, grammar induction, word sense disambiguation, and automatic question
answering. The Stanford NLP Group makes parts of their Natural Language Processing software
available to everyone. These are statistical NLP toolkits for various major computational linguistics
problems. They can be incorporated into applications with human language technology needs. A
distinguishing feature of the Stanford NLP Group is their effective combination of sophisticated and
deep linguistic modeling and data analysis with innovative probabilistic and machine learning
approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broad-
coverage natural-language processing in many languages. These technologies include their
competition-winning coreference resolution system; a state-of-the-art part-of-speech tagger; a
high performance probabilistic parser; a competition-winning biological named entity recognition
system; and algorithms for processing Arabic, Chinese, and German text. All the software is written
in Java. All recent distributions require Oracle Java 6+ or OpenJDK 7+. Distribution packages include
components for command-line invocation, jar files, a Java API, and source code. A number of
helpful people have extended our work with bindings or translations for other languages. As a
result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript,
and F# or other .NET languages.
Link: http://nlp.stanford.edu/software/
![Page 34: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/34.jpg)
CLEAR June 2014 34
I. Introduction
India is a special nation, holding
highly linguistically diverse population with
18 officially recognized languages and other
unofficial languages. Diversity is more than
China (7 languages and hundreds of dialects)
though area size India covers only one third
of China.
A dialect is a form of a language
spoken in a particular geographical area or
by members of a particular social class or
occupational group, distinguished by its
vocabulary, grammar, and pronunciation.
The term is applied most often to regional
speech patterns, but a dialect may also be
defined by other factors, such as social class.
A dialect that is associated with a particular
social class can be termed a Sociolect, a
dialect that is associated with a particular
ethnic group can be termed as Ethnolect, and
a regional dialect may be termed a
regiolect or topolect. According to this
definition, any variety of a language
constitutes "a dialect", including any
standard varieties. A standard dialect is a
dialect that is supported by institutions.
Slang consists of words, expressions,
and meanings that are informal, and are used
by people who know each other very well or
who have the same interests. It include
mostly expressions that are not considered
appropriate for formal occasions; often
vituperative or vulgar. Use of these words
and phrases is typically associated with the
subversion of a standard variety and is likely
to be interpreted by listeners as implying
particular attitudes on the part of the speaker.
In some contexts a speaker's selection of
slang words or phrases may convey prestige,
indicating group membership or
Abstract: A regional or social variety of a language distinguished by pronunciation, grammar, or
vocabulary, especially a way of speaking that differs from the standard variety of the language. It is
a recognized and formal variant of the language spoken by a large group of one region, class or
profession. Whereas slang, that is different from dialect, consists of a lexicon of non-standard words
and phrases in a given language. Dialect resolution is an approach to represent a dialect dialog/word
into its formal format without losing its semantic. It is a localized approach, through which a local
person can express his idea with his own style into a formal format. It is also a method to resolve the
slang words also.
Dialect Resolution
Manu.V.Nair, Sarath K.S Department of Computer Science and Engineering
Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633
![Page 35: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/35.jpg)
CLEAR June 2014 35
distinguishing group members from those
who are not a part of the group.
Among Indian languages, Malayalam
is a highly inflectional language, distinct due
to political and geographical isolation, the
impact of Christianity and Islam, and the
arrival of the Namboothiri Bhrahmins a little
over thousand years ago, all created
conditions favourable to the development of
the local dialect Malayalam. The
Namboothiri grafted a good deal of Sanskrit
on to the local dialect and influenced its
physiognomy. Malayalam language itself
contain many number of dialect variations.
Each one of the districts of Kerala have their
own dialect changed languages.
II. Dialect Resolution
Dialect resolution is an approach to
represent a dialect dialog/word into its
formal format without losing semantic. It is a
localized approach, through which a local
person can express his idea with his own
style into a formal format. Highly informal
slang words also to be resolved.
From the computational point of
view, Dialect Resolution is one of the
difficult task. There are different types of
dialect variations even for a single language.
Computational methods for dialect resolution
includes rule based, statistical based,
machine learning approaches. We can also
use a hybrid approach, ie. mixture of the
above mentioned approaches. A hybrid
approach always have the positives of the
previous mentioned approaches. So, a hybrid
approach may have the higher accuracy since
it have all the positives of each of the single
approaches.
III. Applications
A. Localization
Language localization is the process
of adapting a product that has been
previously translated into different
languages to a specific country or
region.Localization can be done for
regions or countries where people speak
different languages or where the same
language is spoken
Dialect Resolution have an important
role in localization. Since the rich
morphological language Malayalam have
different types of dialects, we have to
incorporate dialect resolution to the
localization process. Different tribal
people can express their ideas &
knowledge to the outer world with their
own language. The Dialect resolution
system will convert it into the formal
![Page 36: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/36.jpg)
CLEAR June 2014 36
language that they belongs to. This will
attract them to engage with outer world,
mainly the government processes &
services that aiming to them. They will
feel free to use their mother tongue.
B. Machine Translation
Machine Translation is the sub-field
of computational linguistics that
investigates the use of software to
translate text or speech from one natural
language to another.
Dialect Resolution is very much
useful when a story in a colloquial is
translated to another language. First, we
can go through the Dialect Resolution
process and make the story in a normal
language format.
Then, it is given to the machine
translation system. Without the help of a
dialect resolution system, it will be
difficult to deal with such colloquial
works. We can implement similar
process in the case of old transcripts. Old
transcripts may contain the details of rare
medicines, several histories etc. These
Information can be translated to another
language only with the support of a
dialect resolving system.
C. Speech to text applications
The application of Dialect Resolution
system in the field of speech to text
application is crucial. If the speech to
text application is possible with certain
good accuracy, then that text may contain
the dialect words. It must be resolved to
use it as a formal one. It is very useful,
when the text in local language can be
converted into formal language and
through that into another languages. This
is very useful in the places like
parliament, legislative assemblies etc.
IV. Issues in Dialect Resolution
As already said, Dialect Resolution is not
an easy task. It is a difficult one, and it have
different issues on implementation.
A. Input Level
Thrissur dialect can be simple
and more complex with compound
and slang words. Some of the inputs
like, sentence containing metaphor
expressions, named entity, large
compound words, or redundant
words, are more complex to resolve
together. Each of them should be
handled separately. As all know,
Malayalam is a highly agglutinative
language with free
order character. Almost many slang
words have change at tail, and
![Page 37: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/37.jpg)
CLEAR June 2014 37
remaining part with its context words
will provide clue about actual word.
B. Machine Learning Level
The dialect patterns are
learned through a learning process.
Smaller the corpus for learning,
lesser will be the accurate for
unknown items. Selection of method
for machine learning is also a crucial
factor. Different learners perform
differently on same corpus. Getting
feature annotated corpus is also a big
task. Named entities in training data
will miss-lead, they have to be
transliterated.
C. Corpus Level
Availability of very large
Malayalam formal-sentence corpus is
a big issue. Also, a dictionary for all
slang words are also a problem.
D. Output Level
Keeping naturality in output
is great challenge. For handling
ambiguous words, need context
informations. Even though
Malayalam is a highly free order
language, for generating semantically
correct sentences we need a proper
word ordering.
V. Conclusion & Future works
It is a great step in localized language
resolution in Malayalam language. The
ultimate possibilities are, bringing all dialect
ideas and informations to a formal format,
which will be readable and understandable to
others without any language boundary within
the language. After that, translation to other
language will be more easier.
Dialect resolution in thrissur slang
can be adapted in other slang of other
language, with sufficient support of corpus.
As an extension of this approach, gender,
tense, person and number informations are
labeled for disambiguation. It will be better
with large corpus of formal sentences and
dictionary with all slang words. A better
machine learning technique like TnT, SVM
or CRF with this large corpus will provide
more better results.
VI. REFERENCES
[1] Daniel Jurafsky and James H Martin.
Speeh and Language Processing: An
Introduction to Natural Language
Processing, Computational Linguistics and
Speech Recognition, Prentice Hall, 1999.
[2] Thorsten Brants, TnT – A Statistical Part-
of-Speech Tagger, Saarland University,
Computational Linguistics, 2000.
![Page 38: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/38.jpg)
CLEAR June 2014 38
M.Tech Computational Linguistics
Department of Computer Science and Engineering
2012-2014 Batch
Details of Master Research Projects
Title Spoken Language Identification
Name of
Student Abitha Anto
Abstract Spoken Language Identification(LID) refers to the automatic process through
which the identity of a language spoken in a speech sample is determined. This
project is based on the phonotactic approach. The phone/phoneme sets are
different from one language to another. Also the frequency of occurrences of
certain phone/phoneme differs from one language to another. Based on the phone
sequences of each language, a language dependent n-gram probability
distribution model is estimated. The language identification is done by
comparing the frequency of occurrences of certain phone or phone sequences
with that of the target languages. The applications of LID system fall into two
categories - preprocessing for machine understanding systems and preprocessing
for human understanding systems. This project tries to identify the Indian
languages such as English (Indian), Hindi, Malayalam, Tamil and Kannada.
Tools HTK, SRILM, Matlab
Place of
Work
Amrita Vishwa Vidyapeetham, Coimbatore.
Title Statistical Approach to Anaphora Resolution for Improved Summary Generation
Name of
Student Ancy Antony
Abstract Anaphora resolution deals with finding of noun phrase antecedents of the
anaphors. It is used in the natural language processing applications such as text
summarization, information extraction, question answering, machine translation,
topic identification etc. The project aims to resolve the mostly occurring pronouns
in the documents such as third person pronouns. A statistical approach is utilized
to perform the accomplished task. Important features are extracted and they are
used for finding the antecedents of anaphors. The proposed system includes the
components such as pronoun extractor, noun phrase extractor, gender classifier,
subject identifier, named entity recognizer, chunker and part-of-speech tagger.
Tools NLTK, TnT, Stanford Parser
Place of
Work
GEC Sreekrishnapuram
![Page 39: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/39.jpg)
CLEAR June 2014 39
Title Anaphora Resolution in Malayalam
Name of
Student Athira S
Abstract An anaphor is a linguistic entity which indicates a referential tie to some other
linguistic entity in the same text. Anaphora Resolution is a process of
automatically finding the pairs of pronouns or noun phrases in a text that refer to
same incidence, thing, person, etc. called referent. The first member of the pair is
called antecedent and the next member is called anaphora. This project tries
to resolve anaphora in Malayalam. We outline an algorithm for anaphora
resolution, while working from the output of a Subject-Verb-Object tagger. And
also Person-Number-Gender agreement is also included in the exisiting tagging
system. Anaphora resolution is done based on the tagging and the degree of
salience(salience value). The anaphora resolution system itself can improve the
performance of many NLP applications such as text summarisation, term
extraction and text categorisation.
Tools TnT
Place of
Work
IIITMK, Thiruvananthapuram, Kerala
Title Extractive News Summarization using Fuzzy Graph based Document Model
Name of
Student Deepa C A
Abstract This system describes a news summarization system based on Fuzzy Graph
Document Models. Modelling documents to Fuzzy Graphs is used to summarize a
set of similar news paper articles. Each article is represented as a fuzzy graph,
whose nodes represent sentences and edges connecting nodes are present if there
exists a similarity between those sentences. This proposed extractive document
summarizer uses fuzzy similarity measure to weight the edges.
Tools WebScrapy, NLTK
Place of
Work
Government Engineering College, Sreekrishnapuram,
Title Text Summarization using Machine Learning Approach
Name of
Student Sreeja M
Abstract This project aims the comparison of summarization algorithms using Rouge
toolkit in DUC 2001 dataset and development of a new algorithm for
summarization and the comparison with the previous works.
Tools Eclipse, Rouge, R-Studio
Place of
Work
Centre for Artificial Intelligence and Robotics , DRDO, Bangalore
![Page 40: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/40.jpg)
CLEAR June 2014 40
Title Fuzzy model-based emotion and sentiment analysis of text documents
Name of
Student Divya M
Abstract Computer systems are inevitable in almost all aspects of everyday life. With the
growth of artificial intelligence , the capabilities and functionalities of computer
systems have been enhanced. Emotions constitute a key factor in human
communication.Human emotion can be expressed through various mediums such
as speech, facial expressions, gestures and textual data. Emotion and sentiment
analysis of text is a growing area of interest in the field of computational
linguistics. Emotion detection approaches use or modify concepts and general
algorithms created for subjectivity and sentiment analysis. In this paper the
emotion and sentiment of text is analysed by means of fuzzy approach. The
proposed method involves the construction of a knowledge base of words which
are known to have emotional content for representing six basic emotions:
anger,fear,joy,sadness,surprise and disgust. The system takes input natural
language sentences, analyses them and determines the underlying emotion. It also
represents the multiple emotion contained in text document. Experimental results
indicate quite satisfactory performance.
Tools nltk,stanford parser
Place of
Work
Government Engineering College, Sreekrishnapuram
Title Ontology based information retrieval system in legal documents
Name of
Student Gopalakrishnan. G
Abstract Ontology serves as knowledge base for any domain which is used by agents to
mine relationships and dependencies, and/or to answer user queries. The domain
of focus possibly holds a valid structure and hierarchy. Indian Penal Code (IPC) is
one such realm which is the apex criminal code of India, fortified into five
hundred and eleven sections under twenty three chapters. Ontology for IPC helps
to create a vista which enables legal persons as well as common man to access the
intended sections and the code in the easiest way. Ontology also provides the
judgments produced over each code which make IPC more transparent and close
to people. Protege and OWL are used to develop the ontology. Once completed,
the ontology serve as an integral reference point for the elite legal community of
the country, and can be applied to information retrieval, decision support/making,
agent technology and question answering.
Tools Protege, Apache Jena, Python regular expression
Place of
Work
Government Engineering College, Sreekrishnapuram
![Page 41: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/41.jpg)
CLEAR June 2014 41
Title Topic Detection using LDA algorithm
Name of
Student Indu M
Abstract Probabilistic topic models are a suite of algorithms whose aim is to discover the
hidden thematic structure in large archives of documents. Topic Modelling can
enhance information network construction by grouping similar objects, event
types and roles together. The documents is considered as a distribution on a small
number of topics, where each topic is a distribution on words. So here the main
aim is to find the most probable topics and its corresponding word distribution
over the topics. One of the main algorithm used here is LDA(latent Dirichlet
allocation ), which is a generative model that allows sets of observations to be
explained by unobserved groups that explain why some parts of the data are
similar.
Tools pythonNLTK,Gensim
Place of
Work
Centre for Artificial Intelligence and Robotics , DRDO, Bangalore
Title Discourse Segmentation using Anaphora Resolution in Malayalam
Name of
Student Lekshmi T S
Abstract A number of linguistic devices are employed in text-based discourse for the
purposes of introducing, defining, refining, and reintroducing discourse entities.
This project looks at one of the most pervasive of these mechanisms, anaphora,
and addresses the question of how discourse is maintained and segmented properly
by resolving anaphora in Malayalam. An anaphor is a linguistic entity which
indicates a referential tie to some other linguistic entity in the same text. The
behaviour of referring expressions throughout the discourse seems to be closely
correlated with the segment boundaries. In general, within the limit of a segment,
referring expressions adopt a reduced form such as pronouns, whereas, a reference
across a discourse boundaries tends to be realized via unreduced forms like
definite descriptions and proper names.We outline an algorithm for anaphora
resolution, while working from the output of a Subject-Verb-Object tagger. Next,
by positioning the anaphoric references, segment boundaries are identified. The
focus of this project is on anaphora resolution as an essential prerequisite to build
the discourse segments of text. The anaphora resolution system itself can improve
the performance of many NLP applications such as text summarisation, term
extraction and text categorisation.
Tools TnT
Place of
Work
IIITM-K, Trivandrum
![Page 42: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/42.jpg)
CLEAR June 2014 42
Title Speaker Verification using i-vectors
Name of
Student
Neethu Johnson
Abstract Speaker verification is the process of verifying the claimed identity of a speaker
based on the speech signal from the speaker. I-vector is an abstract representation
of speaker utterance. In this modeling, a new low dimensional speaker and
channel dependent space is defined using a simple factor analysis. This space is
named the total variability space because it models both speaker and channel
variabilities. Each speaker utterance is represented by an i-vector in total
variability space. We have to carry out channel compensation in the total factor
space rather than in the GMM supervector space followed by a scoring technique.
So, the i-vectors can be seen as new speaker recognition features where the factor
analysis plays the role of feature extractor rather than modeling speaker and
channel effects. The use of the cosine kernel as a decision score for speaker
verification makes the process faster and less complex than other scoring methods.
Tools HTK, LIA-RAL, ALIZE , Matlab
Place of
Work
Amrita Viswa Vidyapeetham, Coimbatore.
Title Topic Identification Using Fuzzy Graph Based Document Model
Name of
Student
Reshma O.K.
Abstract Fuzzy graphs can be used to model the paragraph-paragraph correlation in
documents. Using this model, nodes of the graph represent the paragraphs and the
interconnections represent the correlation. This project aims to find the topic from
the fuzzy graph model of the document. Topic identification refers to
automatically finding the topics that are relevant to a document. The task of topic
identification goes beyond keyword extraction, since relevant topics may not be
necessarily mentioned in the document or its title, and instead have to be obtained
from the context or the concept underlying in the document. The proposed system
uses fuzzy graph for modelling the document. Then using eigen analysis of the
correlation values from the fuzzy graph, the important paragraph can be found out.
The terms and their synonyms in that paragraph are then mapped to predefined
concepts. Thus the topic can be extracted from the document content. Topic
identification can assist search engines in the retrieval of documents by enhancing
the relevancy measures.
Tools NLTK, Stanford CoreNLP, Gensim, Stanford Topic Modelling Toolbox
Place of
Work
Government Engineering College, Sreekrishnapuram
![Page 43: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/43.jpg)
CLEAR June 2014 43
Title Ontology generation from Unstructured Text using Machine Learning
Methods
Name of
Student
Nibeesh K
Abstract This paper presents a system for ontology instance extraction that is automatically
trained to recognize ontology instances using statistical evidence from a training
set. This approach has several advantages: first, it eliminates the need for expert
language- specific linguistic knowledge. To train the automated system, users only
need to tag ontology instances. If the users needs change, the system can relearn
from new data quickly. Second, system performance can be improved by
increasing the amount of training data without requiring extra expert knowledge.
Third, if new knowledge sources become available, they can easily be integrated
into the system as additional evidence.
Tools GATE, Protege, Jena, OWL2,JAVA
Place of
Work
Government Engineering College, Sreekrishnapuram
Title TEXT CLASSIFICATION USING STRING KERNELS
Name of
Student
Varsha K V
Abstract Text Classification is the class of assigning predefined categories to free text
documents. In this task, the text documents are represented as feature vectors of
high dimension. Where the feature values can be n-grams, named entities, words
etc,.. Kernel Methods(KMs) makes use of kernel functions which gives the inner
product of the document feature vectors. The KMs computes the similarity
between text documents without explicitly extracting the features. Thus KMs are
considered as an effective alternative to explicit feature extraction based
classification. The project makes use of String Kernels for text classification. The
String Kernel computes the similarity between two documents by the substring
they contain. The n-gram kernel and gappy n-gram kernels are being used for the
classification. In KMs the learning then takes place in the feature space, learning
algorithms can be applied to this space. The project makes use of Support Vector
Machine algorithm. Where SVMs are a class of algorithms that combine the
principles of statistical learning theory with optimisation techniques and the idea
of a kernel mapping. The non-dependence of KMs on dimensionality of the
feature space and flexibility of using any kernel function makes them a good
choice for text classification.
Tools Openkernel, OpenFST, Libsvm.
Place of
Work
Amrita Vishwa Vidyapeetham ,Ettimadai, Coimbathore.
![Page 44: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/44.jpg)
CLEAR June 2014 44
Title Domain Specific Information Extraction: A Comparison using different Machine
Learning Methods
Name of
Student Prajitha u
Abstract Information extraction is the task of extracting structured information from
unstructured text. It has been widely used in various research areas including NLP
and IR. In this proposed system, a comparison of different machine learning
methods like TnT, SVM, CRF and TiMBL is done to extract perpetrator entity
from the opensource corpus of MUC34, which contain newswire articles of Latin
American terrorism.
Tools TnT, SVM, CRF, TiMBL
Place of
Work
Government Engineering College, Sreekrishnapuram
Title Eigen Analysis Based Automatic Document summarization
Name of
Student
Sruthimol M P
Abstract Automatic document summarization is the process of reducing a text document
into a summary that retains the points highlighting the original document. Ranking
sentences according to their importance of being part in the summary is the main
task in summarization. This paper proposes an effective approach for document
summarization by sentence ranking. Sentence ranking is done by vector space
modeling and eigen analysis methods. Vector space model is used for representing
sentences as vectors in an n-dimensional space assigning tf-idf weighting system.
The principal eigen vectors of the characteristic equation will rank the sentences
according to their relevance. Experimental result using standard test collections of
DUC2001 corpus and Rouge Evaluation system shows that the proposed sentence
ranking based on eigen analysis scheme improves conventional Tf-Idf language
model based schemes.
Tools NLTK, Stanford CoreNLP, ROUGE Tool kit
Place of
Work
GEC Sreekishnapuram
Title SPEECH NONSPEECH DETECTION
Name of
Student SINCY V. THAMBI
Abstract Speech nonspeech detection is for identifying pure speech and to ignore
nonspeech which includes music, noise, various environmental sounds, silence
etc. Speech nonspeech detection is performed by analysing those features which
enables to distinguish them efficiently. Features includes various time domain,
frequency domain and cepstral domain features are analysed for short time frames
of 20ms size along with their mean and standard deviation for segments of size
200ms. Among them best features are extracted using various feature
dimensionality reduction mechanisms. It is noted that an accuracy of 95.085% is
obtained for 2 hours speech-nonspeech database by using decision tree approach.
![Page 45: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/45.jpg)
CLEAR June 2014 45
Tools Matlab, weka
Place of
Work
Amritha Viswa Vidyapeetham, Coimbatore
Title Automatic Information Extraction and Visualization from Defence-related
Knowledge Bases for Effective Entity Linking
Name of
Student Sreejith C
Abstract The project aims to develop an intelligent information extraction system capable
of extracting entities and relationships from natural language reports using
statistical and rule based approaches In this project, the knowledge about the
domain is imparted to the machine in the form of ontology. The extracted entities
are stored into a knowledge base and further visualized as graph based on the
relationships existing between them for effective entity linking and inference.
Tools Java, Jene, Protege,CRF++
Place of
Work
Centre for Artificial Intelligence and Robotics , DRDO, Bangalore
![Page 46: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/46.jpg)
CLEAR June 2014 46
Article Invitation for CLEAR- September-2014
We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted
aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in
Engineering And Research) magazine, publishing on September 2014. The suggested areas of discussion
are:
The articles may be sent to the Editor on or before 10th September, 2014 through the email
For more details visit: www.simplegroups.in
Editor, Representative,
CLEAR Magazine SIMPLE Groups
M.Tech Computational Linguistics
Dept. of Computer Science and Engg,
Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in [email protected]
SIMPLE Groups Students Innovations in Morphology
Phonology and Language Engineering
![Page 47: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/47.jpg)
CLEAR June 2014 47
Hi,
Computational Linguistics is an emerging and promising discipline in
shaping future research and development activities in academia and industry,
in fields ranging from Natural Language Processing, Machine Learning,
Algorithms, Data Mining etc. Whilst considerable progress has been made in
the development of Artificial Intelligence, Human Computer Interaction etc
the construction of software that can understand and process human language
still remains a challenging area. New challenges arise in the modelling of such
complex systems, sophisticated algorithms, advanced scientific and
engineering computing and associated problem-solving environments.
CLEAR is designed to inform readers on the state of art in a number of
specialized fields related to Computational Linguistics. CLEAR addresses the
state of the art of all aspects of Computational Linguistics, highlighting
computational methods and techniques for science and engineering
applications. CLEAR is a platform for all, ranging from academic researcher
and professional communities to industrial professionals in a range of number
of topics in the field of Natural Language Processing and Computational
Linguistics. CLEAR welcomes thought-provoking articles, interesting
dialogues and healthy debates on multifaceted aspects in all areas covered,
which includes audio, speech, and language processing and the sciences and
technologies that support them.
Thank you for your time and consideration.
Sreejith C
![Page 48: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/48.jpg)
CLEAR June 2014 48
![Page 49: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/49.jpg)
CLEAR June 2014 49
Students' Innovation in Morphology Phonology and Language
Engineering (Abbreviated as SIMPLE) is the official website of M.Tech
computational Linguistics, Govt. Engineering College, Palakkad. As the name
indicates, SIMPLE is a platform for showcasing our innovations, ideas and
activities in the field of Computational Linguistics. The applications of AI
become much effective, when systems incorporate natural language
understanding capabilities. Here, we are trying to explore how human
language understanding capabilities can be used to model an intelligent
behavior in computer. We hope our pursuit of excellence will ultimately benefit
the society as we believe “Innovation brings changes to the society.”
SIMPLE has plans and proposals for the active participation in the research as
well as the applications of Computational Linguistics. The association is
interested in organizing seminars, workshops and conferences in this area and
also looking forward to actively network with the people and organizations in
this field.
Our activities are led by the common philosophy of innovation, sharing and
serving the society. We plan to bring out the association magazine that
explains the current technology in a CL perspective.
www.simplegroups.in
![Page 50: Clear june 2014](https://reader034.fdocuments.net/reader034/viewer/2022052305/568c38451a28ab02359e5ef6/html5/thumbnails/50.jpg)
CLEAR June 2014 50