Clear june 2014

CLEAR June 2014 1

CLEAR June 2014 2

CLEAR June 2014 3

C

CLEAR June 2014 Volume-3 Issue-2

CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 www.simplegroups.in [email protected] Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Sreejith C Reshma O K Gopalakrishnan G Neethu Johnson Cover page and Layout Sreejith C

Language Computing – A new Computing Arena Elizabeth Sherly 7 Data Completion using Cogent Confabulation Sushma D K 12 Malayalam POS Tagging Using Conditional Random Fields 17 Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V Topic modelling and LDA algorithm Indu M 22 Tamil to Malayalam Transliteration Kavitha Raju, Sreerekha T V, Vidya P V 26 Memory-Based Language Processing And Machine Translation Nisha M 29

Dialect Resolution Manu.V.Nair, Sarath K.S 34

Editorial 4

SIMPLE News & Updates 5 M.Tech Computational Linguistic Project Abstracts (2012-2014) 38 CLEAR Call for Articles 46 Last word 47

CLEAR June 2014 4

Greetings!

This edition of CLEAR is marked by its variety in content and also by contribution

from eminent scholars such as Prof. Elizabeth Sherly of IIITMK. It is heartening to note

that our efforts are attracting wider attention from the academic community. We seek

the readers' suggestions/comments on the previous edition on Indian Language

Computing. As our efforts will be to bring language technology to the mainstream

academics, we plan to include some articles on NLP/CL tools and platforms for

specialized tasks and also on latest computing paradigms like Map Reduce and MBLP

and their relevance to language technology. So, keep reading!

With Best Wishes,

Dr. P. C. Reghu Raj

(Chief Editor)

CLEAR June 2014 5

Publications

Text Based Language Identification System for Indian Languages Following Devanagiri

Script, Indhuja k, Indu M, Sreejith C, Dr. Reghu Raj P C, International Journal of Engineering

Research & Technology (IJERT), Volume. 3 , Issue. 04 , April - 2014, ISSN: 2278-0181

"Eigenvector Based Approach for Sentence Ranking in News Summarization", Divya S,

Reghuraj P C, JCLNLP, April-2014

"A Natural Language Question Answering System in Malayalam Using Domain Dependent

Document Collection as Repository", Pragisha K and P C Reghu Raj, IJCLNLP, April-2014

Box Item Generation from News Articles Based Paragraph Ranking using Vector Space

Model, Sreejith C, Sruthimol M P, P C Reghuraj, International Journal of Scientific Research

in Computer Science Applications and Management Studies ( IJSRCSAMS), Volume 3, Issue 2,

March 2014, ISSN 2319 – 1953

SIMPLE Groups Congratulates all the authors for their achievement!!!

SIMPLE Groups Congratulates all of them for their S for her achievement s !!!

NEWS & UPDATES

CLEAR June 2014 6

Industrial Training at IIITM-K

Virtual Resource Centre for Language Computing(VRC-LC) department of Indian Institute of

Information Technology and Management -Kerala(IIITM-K) had organized a short course and

industrial training on Natural Language Processing exclusively for the M.Tech students of GEC,

Sreekrishnapuram.

The course was focused on recent trends in Computational Linguistics and Machine Learning

related to Malayalam Computing. It was 15 days programme (from 5th May - 20th May). During the

course various reserch scholars and eminent faculties of VRCLC delivered their sessions on various

aspects of Language Processing. The discussion of various works on Malayalam computing

preceedings in their centre gave clear idea on the working and challenges involved in Malayalam

Computing for the participants.

CLEAR June 2014 7

Language Computing: A New Computing

Arena

Elizabeth Sherly,

Professor,

IIITM-Kerala

The great strides made in Information

Technology in every walk out have slowly

developed a prominence in Language

Computing, which has got immense

importance in today's computing world. A

country like India, which values its culture,

heritage and language, is greatly in need of

local language support so as to combat the

dominance of English in computing. India

has had about 25 years of history in language

computing (LC) which has gone through its

ups and down. But during the last decade

there has been a paradigm shift with a

significant leap to LC as a new computing

arena. Scientists who were somewhat

reluctant to take up language computing for

research turned out to choose LT as a

main stream of research and interestingly

industry giants like Microsoft and Google

entered into Language Computing in a big

way. So, a phenomenal shift in LT has

happened as Language Technology tools

become inevitable to enhance the products

and services in high growth markets such as

mobile application, healthcare, IT services,

financial services, online retail, call centers,

publishing and media etc.

Thanks to Noam Chomsky, a

mathematician turned linguist, who

expressed structure of human language to the

level of mathematically viable symbols by

introducing the concepts of TREE diagram.

Then the research in computational

linguistics gained good momentum because

there is a way to represent linguistics in

mathematical and logical form, that enable to

convert a piece of text into a programmer

friendly data structure. As natural language

involves human understanding and

processing, Artificial Intelligence also plays

a significant role in the development of

Computational Linguistics models. Here the

term Language Computing (LC) and

Computational Linguistics (CL) are

sometimes used interchangeably, both are

CLEAR June 2014 8

the key terminologies derived from Natural

Language Processing (NLP).

Some of the major applications in LC

are Machine Translation, Information

Retrieval, Automatic Summarization,

Question Answering, Automatic Speech

Recognition, Language Writing and Spoken

Aids, Dialog Systems, Man-Machine

Interfaces, Knowledge Representations etc.

Machine Translation (MT) is one of the

major tasks in which research has gone back

to the 1950's, but still remains as an open

problem. There are no 100% accurate

Machine Translation systems in any pair of

languages in the world. The day when we

achieve that, most of the other problems in

LT will be resolved. The major tasks

involved in Machine Translation Systems are

to describe the language into syntactic and

Semantic Information. The syntactic

information are generated using

Morphological Analysis, Parts-of-Speech

(POS) tagging, Chunking, and Parsing. The

semantic information are achieved using

Word-Sense Disambiguation (WSD),

Semantic Role Labelling (SRL), Named

Entity Recognition (NER), and Anaphora

Resolution(AR). The primary research in

CL is basically to develop models and tools

for the above mentioned components. The

challenge in this work is that each language

has diverse linguistic nature with varied

morphological features, inflectional sets,

grammatical structure etc. which made

language computing research more complex.

The availability of large corpora and

dictionary in each language is another

constraint for researchers in LC.

In India, there has been a phenomenal

shift in Language Technology or

computational Linguistics research and

development in the last several years as

Language Technology tools become

ineviatable in many applications. The

Department of Electronics and Information

Technology (DEITY),India initiated

Technology Development for Indian

Languages(TDIL) (http://tdil.mit.gov.in)

with the objective of developing Information

Processing Tools and Techniques to facilitate

human-machine interaction without language

barrier; creating and accessing multilingual

knowledge resources; and integrating them

to develop innovative user products and

services. The major activities of TDIL

involves Research and Development of

Language Technology, which includes

machine translation, multi-lingual parallel

corpora, cross lingual information access,

optical character recognition and text to

CLEAR June 2014 9

speech conversion, and development of

standards related to Language Technology

etc. Various projects and research have been

undergoing under TDIL with the effort of

number of scientific organizations and

Educational Institutions. There are

institutions like IITs in India, IIIT-

Hyderabad, IIITM-Kerala, CIIL-Mysore,

Hyderabad Central University, Centre for

Computer Science and Sanskrit Centre-JNU,

New Delhi, CDAC are the main centres

where computational linguistics is taught and

researched.

Apart from Machine Translation, the

research directions in Language Computing

are towards Automatic Speech Recognition,

Speech to Text Processing, Web and

Semantics, Sentiment analysis, Named

Entity Recognition(NER), Anaphora

Resolution, Word-Sense Disambiguation etc.

It is not too futuristic to believe that we

could be talking to the computer which can

act according to our commands. New

techniques and models have to be explored

for better results. Ontology and Semantics,

phycho linguistic analysis are another

upcoming areas relies on research to get

more sensible information from the web and

other applications. The mobile phone being

one of the handy and largely used gadgets

needs local language support for its various

features. This is yet another potential area for

research. Since the Internet of Things (IOT)

is closely associated with Machine -to -

Machine, Machine -to -Human, Machine-to-

small devices, it needs local language

enablement to provide information to the

local masses.

The opportunities in Industry are also

very demanding for CL. One look at the

2013 Gartner predictions tells us that there is

a huge demand for Computational Language

Scientists. Natural Language Processing and

Speech Recognition are no doubt the

prominent considerations for the most

expected technologies in the near future.

Industries that employ computational

linguists are tremendously forging ahead

because currently most of the web, mobile,

and social media based work need extensive

language support. The industry looks for

mainly three roles in Language Computing:

linguists, computer programmers (both

having knowledge in language computing)

and researchers. Researchers in both

Linguistics and Computer Science also have

good opportunities in the industry.

Microsoft, Google, IBM, HP and several

other companies working heavily in this field

needs trained manpower in LT. The global

CLEAR June 2014 10

market value of 19.3 billion in 2011 has been

predicted to shoot to 30 billion in 2015. The

major thrust areas are in Machine

Translation, to create systems and

technologies catering to today‘s multitude of

translation scenarios; multilingual systems,

to develop a natural-language-neutral

approach in all aspects of linguistic

computing; and natural-language processing,

Automatic Speech Recognition, to design

and build software which can analyze,

understand, and generate natural human

languages so as to enable addressing a

computer like addressing a human being.

The work on Malayalam Computing

has been active for the last decade so as to

enable computer to understand and process

Malayalam. The major work in Malayalam

Computing involves Machine Translation

System for Malayalam to another language

and vice versa. Spell Checker, Malayalam

Search Engine, Malayalam text to speech

and speech to text system, Morphological

Analyser and POS taggers for Malayalam,

Malayalam Text Editors, Human Interaction

Interfaces, Malayalam Language tutors,

Corpora building, Dictionary building etc.

CDAC, CDIT, IIITM-K, AUKBAC-

Chennai, CIIL-Mysore, IIIT-Hyderabad,

Amrita are some of the major institutions

actively pursuing research and product

development in Malayalam Computing.

There are also certain NGOs and

Communities actively engaged in work

related to Malayalam Computing.

Despite all the efforts made by many

contributors, the awareness among masses is

very less. A mechanism for encouraging and

promoting Indian Language Computing is

the need of the hour. There should be flag

ship programmes and movements in the

state, in academic institutions and other

organizations to give more visibility and

accessibility which can ensure reachability in

every corner of the nation. Language

Computing and its use, support, and

standardization have to be well placed in

India‘s IT policy, both at the national and

state levels. Each language group shall

actively promote their language and culture,

not only from their own state people, but

from wherever that language group has

existence worldwide. We may take

Malayalam as a language and some of the

promotional activities that can be done at the

state and college levels are listed below.

Include the importance of Malayalam

Computing in Kerala‘s IT Policy.

CLEAR June 2014 11

Create Internet mailing list, say

Malayalam.Net and introduce various

tools and products developed with its

use.

Form a Malayalam Computing Task

Force. Promote and support various

activities and form PMUs for

implementation.

Establish a non-profit,

Government/non Government body

to promote Malayalam Computing.

Promote periodical workshops,

seminars and conferences in

Malayalam Computing in educational

institutions

Make Language Computing

mandatory for Computer Science

courses at the university level

Promote Malayalam Computing

using social media

Publish articles, research papers,

tools and products, news on LC in

magazines, journals and notice

boards.

Due to inherent structure of Malayalam

language as a highly agglutinative and

inflectional language, an issue in Malayalam

Computing is many compared to other

languages. The rendering issues on the

display of Malayalam font require greater

attention for better outlook. The

Computational Models for various tasks

namely Morph Analysis, POS tagging etc are

to be refined for more accurate results. Most

of the models are rule based, statistical or

hybrid models like Support Vector Machine,

HMM and TnT that shows 95-100 %

accuracy. However the Machine Translation

System has shown up to an accuracy of 60-

65% only. Deep Learning has recently

shown many promises for NLP applications.

The deep Neural Network with its capability

to learn distributed representation based on

similarity of words and its ability to learn

multiple level of representations is

encouraging for cognitive problem solving

like natural language understanding and

Speech Recognition. The days of using

Computer with Malayalam input devices and

interfaces with Malayalam Processing and

understanding that the way human handles is

not very far.

CLEAR June 2014 12

Data Completion using Cogent

Confabulation

Sushma D K

M.Tech

Dept. of computer Science

SJBIT,

Bangalore

Introduction

Information is being generated on a

daily basis and continuously uploaded onto

the web inmassive quantity. This information

is of many types ranging from simple text

files to videofiles. This information may

sometimes be incomplete due to various

reasons like variations in electrical signals,

intentional causes, unknowingly overwriting

the data, etc,. Cogent confabulation is a

unique, and alien technique to complete the

missing data. The Cogent confabulation

problem consists of two major tasks: (1)

training the available data to generate

knowledge bases (matrices storing the

appearance counts of words in the training

corpus); (2) querying over this trained data to

predict the next word or phrase, given the

starting few words of the phrase or sentence.

To overcome the difficulty of

completing the missing information, the

previously available data has to be first

studied extensively. Later, this data can be

trained at the semantic level using a

particular model, and used in the completion

of the missing data. Cogent confabulation is

a new model of vertebrate cognition used for

training and querying the available data.

Confabulation is the process of selecting that

one symbol (termed as the conclusion of the

confabulation) whose representing words

happen to be receiving the highest level of

excitation (appearance count). It explains

confabulation product as the probability of

occurrence of the assumed words

individually with the target word. Cogency is

defined as the probability of occurrence of

all the assumed words together with the

target word.

To predict and fill the missing data,

the procedure starts with maximizing the

cogency p(αβγδ|ε) and confabulation product

p(α|ε).p(β|ε).p(γ|ε).p(δ|ε). The order in which

the corpus words are chosen for training is

meant to reflect the pattern of text contained

CLEAR June 2014 13

in the expository text. The approach uses

lexical analyses based on appearance count

of words, an information retrieval

measurement, to determine the extent of

possibility of a particular word being the

conclusion word. Phrase completion, and

sentence continuation using confabulation

model should be useful for many text

analysis tasks, including information

retrieval and summarization, generating

artificial stories etc

Confabulation

Confabulation is a new model of

vertebral cognition which mimics the

Hebbian learning, and the information

processing procedures of human brain.

Cognitive information processing is a direct

evolutionary re-application of the neural

circuits controlling movement, and thus

functions just like movement. Conceptually,

brains are composed of many muscles of

thought (termed thalamocortical modules in

mammals). A module contains symbols, each

of which is a sparse collection of neurons

which functions as a descriptor of the

attribute of that module. For example, if the

attribute of a module is color, then a single

symbol represents a particular color. Each

thalamocortical module is connected to many

other modules. When two symbols are active

simultaneously, they are said to co-occur,

which creates the opportunity to associate

the two symbols. For instance, after seeing a

face and hearing a name together, the

symbols representing each may become

associated. Each strengthened unidirectional

association between two symbols is termed a

knowledge link. Collectively knowledge

links comprise all cognitive knowledge.

Each thalamocortical module performs the

same single information processing

operation, which can be thought of as a

contraction of a list of symbols, termed a

confabulation. Throughout a confabulation,

input excitation is delivered to the module

through knowledge links from active

symbols in other modules lists of candidate

conclusion symbols, driving the activation of

these knowledge links target symbols in the

module performing the confabulation. When

conclusion symbols contract, there is no

physical movement in the brain, rather

symbols currently on the list compete (based

upon their relative excitation levels) for

eventual exclusive activation (a so-called

winner-take-all competition) within that

module and, as a result, the number of active

symbols is gradually reduced.

Crucially, this contraction of the

candidate conclusion symbol list in each

thalamocortical module is externally

CLEAR June 2014 14

controlled by a thought control signal

delivered to the module. A confabulation in a

thalamocortical module is controlled by a

graded analog control input. The thought

control signal determines how many symbols

remain in the competition, but has no effect

on selecting which symbols are in the

competition. Which symbols are in the

competition is determined by the excitation

level of a symbol as it dynamically reacts to

knowledge link input from active symbols in

other modules (which cause its excitation

level to increase) or to a reduction or

cessation of such input (which causes its

excitation level to fall). Ultimately, the

thought control signal is used to dynamically

contract the number of active symbols in a

module from an initial many less-active

symbols to, at the end of the confabulation, a

single maximally-active symbol. The

resulting single active symbol is termed the

confabulation conclusion. The learned

association between each symbol of a

module and its set of action commands is

termed skill knowledge. Skill knowledge is

stored in the module, but the learning of

these associations is controlled by

subcortical brain nuclei. When a conclusion

is reached in a module, those action

commands which have a learned association

from that conclusion symbol are instantly

launched. These issued action commands are

proposed as the source of all non-reflexive

and non-autonomic behaviors.

Thalamocortical modules performing

confabulations, delivering excitation through

knowledge links, and applying skill

knowledge through the issuance of action

commands constitute the complete

foundation of all mammalian cognition.

These basic concepts of cognition, and

confabulation are applied in the artificial

intelligence, and machine learning field to

predict, fill, and complete the missing data in

the sentences. It can also be used for context

representation, context exploitation, itelligent

text recognition, and to generate artificial

stories

Methodology and applications

One of the methodologies that are used for

data completion is Confabulation, which is

explained before. The other concept is

cogency. It is the occurrence of all the

assumed fact words together with the

assumed target word i.e. p(αβγδ|ε).

A. Terminologies:

According to the confabulation model,

assuming that the combined assumed facts

αβγδ are true, then the set of all symbols (in

the answer lexicon from which conclusions

CLEAR June 2014 15

are being sought) with p(αβγδ|ε) > 0 is called

the expectation; the elements of which, are

termed answers, in the descending order of

their cogencies. Confabulation

(maximization of the product

p(α|ε).p(β|ε).p(γ|ε).p(δ|ε) or, equivalently, the

sum of the logarithms of these probabilities)

as a surrogate for maximizing cogency (it is

assumed that all required pairwise

conditional probabilities p(Ψ|λ) between

symbols Ψ and λ are known. This

assumption is termed exhaustive

knowledge). Each meaningful non-zero

p(Ψ|λ) is termed an individual item of

knowledge.

B. Applications

These basic concepts of cognition and

confabulation are applied in the artificial

intelligence, and machine learning field to

predict, fill, and complete the missing data in

the sentences. It can also be used for context

representation, context exploitation,

intelligent text recognition, and to generate

artificial stories.

Conclusion and Future Works

This paper is an initiative for better

data completion, and manipulation. The

major requirement for the process is the

previous complete text corpus. The previous

works fill in the missing data based on the

Bayesian models, where the words are

selected by increasing their posterior

probability (selecting the conclusion which

has the highest probability of being correct).

Though Bayesian theory has been

extensively used in neural networks, it is an

awful model for cognition, and data

completion. Because, it simply selects a

word with highest probability value, even if

it is irrelevant to the given context. Many

open source tools can be used for the easy

development. These basic concepts of

cognition, and confabulation are applied for

context representation, context exploitation,

intelligent text recognition, and to generate

artificial stories. The longer execution times

due to lexicon overhead can be reduced by

parallel processing. Applications that

demand high throughput will have to

evaluate the proposed confabulation method

depending on the hardware available.

REFERENCES

[1] Robert Hecht-Nielsen, Confabulation

Theory, The Mechanism of Thought 2007.

[2] Qinru Qiu , Qing Wu, Daniel J. Burns,

Michael J. Moore, Robinson E. Pino,

Morgan Bishop, and Richard W. Linderman,

Confabulation Based Sentence Completion

for Machine Reading, IEEE, 2011.

CLEAR June 2014 16

[3] Qinru Qiu, Qing Wu, and Richard

Linderman, Unified Perception-Prediction

Model for Context Aware Text Recognition

on a Heterogeneous Many-Core Platform,

Proceedings of International Joint

Conference on Neural Networks, San Jose,

California, USA, July 31 – August 5, 2011

[4] Fan Yang 1 , Qinru Qiu , Morgan Bishop

, and Qing Wu, Tag-assisted Sentence

Confabulation for Intelligent Text

Recognition, IEEE, 2012

[5] Darko Stipaničev, Ljiljana Šerić, Maja

Braović, Damir Krstinić, Toni

Jakovčevic, Maja Štula, Marin Bugarić and

Josip Maras, Vision Based Wildfire and

Natural Risk Observers, IEEE, 2012.

Computer chatbot 'Eugene Goostman' passes the Turing test

Eugene Goostman is a chatterbot. First developed by a group of three programmers; the

Russian-born Vladimir Veselov, Ukranian-born Eugene Demchenko, and Russian-born Sergey

Ulasen in Saint Petersburg in 2001,Goostman is portrayed as a 13-year old Ukranian boy—a

trait that is intended to induce forgiveness in users for his grammar and level of knowledge.

The Goostman bot has competed in a number of Turing test contests since its creation, and

finished second in the 2005 and 2008 Loebner Prize contest. In June 2012, at an event marking

what would have been the 100th birthday of their namesake, Alan Turing, Goostman won what

was promoted as the largest-ever Turing test contest, successfully convincing 29% of its judges

that it was human. On 7 June 2014, at a contest marking the 60th anniversary of Turing's death,

33% of the event's judges thought that Goostman was human; the event's organizer Kevin

Warwick considered it to have "passed" Turing's test as a result, per Turing's prediction that by

the year 2000, machines would be capable of fooling 30% of human judges after five minutes of

questioning.

CLEAR June 2014 17

Malayalam POS Tagging Using Conditional

Random Fields

Archana T. C, Haritha Lakshmi, Krishnapriya, Sreesha, Jayasree N V

Dept. of Computer Science and Engg,

Sreepathy Institute of Management and Technology, Vavanoor

I. Introduction

India is a large multi lingual country of

diverse culture. It has many languages with

written forms and over a thousand spoken

languages. The Constitution of India

recognizes 22 languages, spoken in different

parts of the country. The languages can be

categorized into two major linguistic families

namely Indo Aryan and Dravidian. These

classes of languages have some important

differences. Their ways of developing words

and grammars are different. But both include

a lot of Sanskrit words. In addition, both

have a similar construction and phraseology

that links them together. There is a need to

develop information processing tools to

facilitate human machine interaction ,in

Indian Languages and multi lingual

knowledge resources. A POS tagger forms

an integral part of any such processing tool

to be developed. Parts of Speech Tagging, a

grammatical tagging, is a process of marking

the words in a text as corresponding to a

particular part of speech, based on its

definition and context- i,e relationship with

adjacent and related words in a phrase,

sentence ,or paragraph. This is the first step

towards understanding any languages. It

finds its major application in the speech and

Parts of speech tagging is the process of assigning tags to the words of a given sentence. This paper presents

the building of Part-Of-Speech (POS) tagger for Malayalam Language using Conditional Random Field

(CRF). POS tagger plays an important role in Natural language applications like speech recognition,

natural language parsing, information retrieval and information extraction. The present tagset consists of

100 tags.It consists of a language model, which is trained by an annotated corpus of 3026 sentences (36,315

words). This model checks the trigram possibility of occurrence of tag in the training corpus. We present a

trigram HMM-based (Hidden Markov Model) part-of-speech (POS) tagger for Malayalam language, which

will accept a raw text to produce a POS tagged output. We can improve the accuracy of the system by

increasing the size of the annotated corpus. Although the experiments were performed on a small corpus,

the results show that the statistical approach works well with a highly agglutinative language like

Malayalam..

CLEAR June 2014 18

NLP like Speech Recognition, Speech

Synthesis, Information retrieval etc. A lot of

work has been done relating to this in NLP

field. A lot of work has been done in part of

speech tagging of western languages. These

taggers vary in accuracy and also in their

implementation. A lot of techniques have

also been explored to make tagging more and

more accurate. These techniques vary from

being purely rule based in their approach to

being completely stochastic. Some of these

taggers achieve good accuracy for certain

languages. But unfortunately, not much work

has been done with regard to Indian

languages especially Malayalam. In this

paper we have developed a POS tagger based

on Conditional Random Field (CRF).

Conditional Random Fields (CRFs), the

undirected graphical models, are used to

calculate the conditional probability of

values on designated output nodes given

values on other designated input nodes. It

can clearly be seen that generative model

(HMM) here has performed quite close to

CRF. A Hidden Markov Model (HMM) is a

statistical model in which the system

modeled is thought to be a Markov process

with the unknown parameters. In this model,

the assumptions on which it works are the

probability of the word in a sequence may

depend on its immediate word preceding it

and both the observed and hidden words

must be in a sequence. It has been

experimentally shown that the accuracy of

the POS tagger can be improved

significantly by introducing a trigram

template an efficient corpus and a widely

accepted tagset. This paper mainly

concentrates on designing a POS Tagger,

using CRF++, which is an open source

implementation of CRF.

System Architecture

The system consists of 3 modules

namely Preprocessing, Training and Testing.

The architecture of the proposed system is

depicted in Fig.1.

Fig 1. System Architecture

Preprocessing is the initial stage for

the implementation of a Malayalam POS

Tagger. In preprocessing it takes the input

sentence and tokenize the sentence,i.e it

CLEAR June 2014 19

receives the input as sentence and split it into

words. Thus splitted words are stored in a

file and it becomes the input for the testing

phase.

The proposed method uses a

supervised mode of learning for POS

tagging. The simplest statistical approach

finds out the most frequently used tag for a

specific word from the annotated training

data and uses this information to tag that

word in the unannotated text. This is done

during the training phase. We are using

statistical approach for POS tagging i.e. We

train and test our model for this we have to

calculate frequency and probability of words

of given corpus.

During training, the annotated corpus

is trained using the CRF Template. All the

required files in performing the procedure is

explained with example below. For training,

the CRF Learn command is used as shown

below:

crf learn template file train file model file

The template file provides the correct

position of the words to be tagged .If a

unigram template is used, then a word is

tagged based on the current word. If Bigram

statistics is used, then the tagging is done

based on the current and previous word. On

increasing the number of grams, the template

becomes more useful and tagging becomes

more efficient. The template file can be

represented as shown in Fig. 2.

Fig. 2 CRF Template

The tagset used in the system is

developed by Bureau of Indian Standards.

Tagset is a list of short forms representing

the components of sentence like nouns, verbs

and their sub forms. The corpus training is

performed with the help of the tag set. The

tag set used in the system contains 100 tags.

The training will create a model file as

output, which contain the learned

probabilities of the corpus.

CLEAR June 2014 20

Fig.3 Training Corpus

Testing is the process of comparing

the trained model file with the tokenized

input file and obeying rules in template and

tags in tagset finding out the corresponding

tags of each and every words. Use crf test

command:

crf learn template file train file model file

A screenshot of the proposed system

is given in Fig.4.

Fig 4: Screenshot

Conclusion

The motivation of this project is to

help children and foreigners to learn the

structure of a Malayalam sentence. By using

Python programming language as the

development environment the application is

built keeping in mind about the design

standards and maintainability of the code.

CRF++, by using the Hidden Markov Model

and Tkinter features provide a rich user

CLEAR June 2014 21

experience to the users of the software. This

application is very simple to use and is

helpful to people who are in the preliminary

stage of learning Malayalam language.

References

1. Rajeev R R, Jisha P Jayan, "Part of speech

tagger for malayalam", Computer

Engineering and Intelligent Systems, Vol 2,

No.3 June 2011.

2. Christopher.D.Manning,Hinrich Shutze, "

Fundamentals of Statistical Natural

Language Processing", MIT Press, 1999.

3. Asif Ekbal,Rajwanul Haque,Sivaji

Bandhopadhyay,‖Bengali Part Of Speech

Tagging using Conditional Random

Fields‖,Department of CSE Javadpur

University Kolkata-70032,India

4. Anish, ‖Part of Speech Tagging for

Malayalam‖, Amritha School of Engineering

Coimbatore, Amritha Vishwa

Vidyapeetham,Coimbatore-641105

5. Steven Bird, et al, Natural Language

Processing with Python, Orielly

Publications, 2011.

6. Michael Collins,‖Tagging with Hidden

Markov Model ‖, Lecture Notes.

7. CRF++ Home Page,

https://code.google.com/p/crfpp/, Last visited

on March 2014.

Making the world’s knowledge computable

Wolfram Alpha introduces a fundamentally new way to get knowledge and answers not by

searching the web, but by doing dynamic computations based on a vast collection of built-in data,

algorithms, and methods.

http://www.wolframalpha.com/

CLEAR June 2014 22

Topic modelling and LDA algorithm

Indu M

Dept. of computer Science Govt. Engineering College,

Sreekrishnapuram

Introduction

As electronic documents become

available, it becomes more difficult to find

and discover what we need. We need new

tools to help us organize, search, and

understand these vast amounts of

information.

Topic models are based upon the idea

that documents are mixtures of topics, where

a topic is a over words. A topic model is a

generative model for documents: it specifies

a simple probabilistic procedure by which

where a topic is a documents can be

generated. To make a new document, one

chooses a distribution over topics. Then, for

each word in that document, one chooses a

topic at random according to this

distribution, and draws a word from that

topic.

Probabilistic topic models are a suite

of algorithms whose aim is to discover the

hidden thematic structure in large archives

of documents. One of the simplest topic

models is latent Dirichlet location (LDA).

The intuition behind LDA is that documents

exhibit multiple topics. Most topic models,

such as LDA [1] are unsupervised. Only the

words in the documents are modeled, and the

goal is to infer topics that maximize the

likelihood (or the posterior probability) of

the collection.

The main application of topic

modeling is in Information Extraction. And it

can also be used to analyze, summarize, and

categorize the stream of text data at the time

of its arrival. For example, as news arrives in

streams,organizing it as threads of relevant

articles is more efficient and convenient.

Topic modelling programs do not

know anything about the meaning of the

words in a text. Instead, they assume that any

piece of text is composed (by an author) by

A topic model that which automatically captures the thematic patterns and identifies emerging topics of text streams and their changes over time. It is used to check models, summarize the corpus, and guide exploration of its contents. Topic Modelling can enhance information network construction by

grouping similar objects, event types and roles together.

CLEAR June 2014 23

selecting words from possible baskets of

words where each basket corresponds to a

topic. If that is true, then it becomes possible

to mathematically decompose a text into the

probable baskets from whence the words

first came.

To make a new document in topic

modelling, one chooses a distribution over

topics. Then, for each word in that

document, one chooses a topic at random

according to this distribution, and draws a

word from that topic.

The model specifies the following

distribution over words within a document:

where T is the number of topics, p(z) for the

distribution over topics z in a particular

document and p(w|z) for the probability

distribution over w words given z topic.

LDA

The basic ideas behind latent Dirichlet

allocation (LDA), which is the simplest topic

model [3]. The intuition behind LDA is that

documents exhibit multiple topics. For

example, consider the article in Figure 1[1].

This article, entitled ―Seeking Life's Bare

(Genetic) Necessities," is about using data

analysis to determine the number of genes

that an organism needs to survive.

In this it is highlighted the different

words that that are used in the article. Words

about data analysis, such as ―computer" and

―prediction," are highlighted in blue; words

about evolutionary biology, such as ―life"

and ―organism", are highlighted in pink;

words about genetics, such as ―sequenced"

and ―genes," are highlighted in yellow. Here

in this figure at left side, this represents some

no of topics. Each document is assumed to

be generated as follows. First choose a

distribution over the topics (the histogram at

right); then, for each word, choose a topic

assignment (the coloured coins) and choose

the word from the corresponding topic.

LDA models can be used to find

topics that describe a corpus and each

document exhibit multiple topics.

Graphical model of LDA

CLEAR June 2014 24

Methodology and applications

One of the methodologies that are used for

topic detection is LDA, which I have

explained before. Theoretical studies of topic

modelling focus on learning the model‘s

parameters assuming the data is actually

generated from it. Existing approaches for

the most part rely on Singular Value

Decomposition (SVD), and consequently

have one of two limitations: these works

need to either assume that each document

contains only one topic, or else can only

recover the span of the topic vectors instead

of the topic vectors themselves. Here in

SVD, we assume that each document contain

only one topic or else span of topic vector

instead of topic vector themselves.

Other probabilistic models such as naive-

Bayes Dirichlet Compound Multinomial

(DCM) mixtures, probabilistic Latent

Semantic Indexing (pLSI), were also used to

find the relevant topics. We can also simplify

the topic distribution by modelling each

topic as a discrete probability over

documents.

A.Algorithm

The corpus contains a collection of

documents.

For each document in the collection, we

generate the words in a two-stage process.

1. Randomly choose a distribution over

topics

2. For each word in the document:

(a) Randomly choose a topic from

the distribution over topics in step

#1.

(b) Randomly choose a word from

the corresponding distribution

over the vocabulary.

This statistical model rejects the

intuition that documents exhibit multiple

topics. Each document exhibits the topics

with different proportion (step #1); each

word in each document is drawn from one of

the topics (step #2b), where the selected

topic is chosen from the per-document

distribution

Here in order to evaluate the

predictive power of a generative model on

unseen data, there is a standard way known

CLEAR June 2014 25

as perplexity. By this, it is able to predict

relate words from different languages.

B. Applications

Topic models are good for data

exploration, when there is some new data set

and you don't know what kinds of structures

that you could possibly find in there. But if

you did know what structures you could find

in your data set, topic models are still useful

if you didn't have the time or resources to

construct classification models based on

supervised machine learning. Lastly, if you

did have the time and resources to construct

classification models based on supervised

learning, topic models would still be useful

as extra features to add to the models in

order to increase their accuracy. This is the

case because topic models act as a kind of

"smoothing" that helps combat the sparse

data problem that is often seen in supervised

learning.

The topic modeling can be used in

computer vision, an inference algorithm

applied to natural texts in the services of text

retrieval, classification, an d organization

build text hierarchy etc. it can also be used

in WSD, machine learning, organize,

summarize, and help users to explore large

corpora, information engineering

applications, scientific applications such as

genetic/ neuro science etc.

Conclusion and Future Works

Documents are partitioned into

topics, which in turn have terms associated

to varying degrees. However in practice,

there are some clear issues: the models are

very sensitive to the input data small changes

to the stemming/ tokenization algorithm can

result in completely different topics; topics

need to be manually categorized in order to

be useful; topics are ―unstable ― in the sense

that adding new document can cause

significant change to the topic distribution.

REFERENCES

[1] D. Blei. Introduction to probabilistic

topic models, Communications of the ACM

pp. 77-84 2012.

[2] Vivi Nastase, Introduction to topic

models: Building up towards LDA, summer

semester 2012

[3] D. Blei. Et. al. supervised topic models,

Princeton university

[4] David Hall et al. Studying the History of

Ideas Using Topic Models, Stanford

University.

CLEAR June 2014 26

Tamil to Malayalam Transliteration

Kavitha Raju, Sreerekha T V, Vidya P V

M.Tech CL

GEC, Sreekrishnapuram

Palakkad

I. Introduction

The rewriting or conversion of the characters

of a text from one writing system to another

writing system is called transliteration. Here

each character of the source language is

assigned to a different unique character of

the target language, so that an exact

inversion is possible. If the source language

consists of more characters than the target

language, combinations of characters and

diacritica can be used.

Machine transliteration systems have a great

importance in a country like India which has

a fascinating diversity of languages. Even

though there are groups of language that

comes from common origin, the difference in

scripts makes the task cumbersome.

Machine transliteration systems can be

classified into rule-based and corpus-based

approaches, in terms of their core

methodology. Rule-based systems develop

linguistic rules that allow the words to be put

in different places, to have different scripts

depending on context, etc. The main

approach of these systems is based on

linking the structure of the given input with

the structure of the demanded output,

necessarily preserving their unique meaning.

Minimally, to get a transliteration of source

language sentence one needs a dictionary

that will map each source language word to

an appropriate target language word, rules

representing source language word structure

and rules representing target language word

structure. Because rule-based machine

transliteration uses linguistic information to

mathematically break down the source and

target languages, though, it is more

predictable and grammatically superior than

Statistical Method. Statistical machine

transliteration employs a statistical model,

based on the analysis of a corpus, to generate

text in the target language. The idea behind

statistical machine transliteration comes

from information theory. A document is

transliterated according to the probability

distribution p(e|f ) that a string e in the target

language is the transliteration of a string f in

the source language.

Abstract : Transliteration can form an essential part of transcription which converts text from one

writing system to another. This article discusses about the applications and challenges in machine

transliteration from Tamil to Malayalam, two languages that belong to the Dravidian family.

Transliteration can be used to supplement machine translation process by handling the issues that can

happen due to the presence of named entities.

CLEAR June 2014 27

There are two primary motivations behind

pursuing this project. First one is that in

India, majority of the population still use

their mother-tongue as the medium of

communication and the next is that in spite

of globalization and wide-spread influence of

the West in India, most of the people still

prefer to read, write and talk in their mother-

tongue.

II. Language Features

Malayalam is a language spoken in India,

predominantly in the state of Kerala. It is one

of the 22 scheduled languages of India and

was designated a Classical Language in India

in 2013. Malayalam has official language

status in the state of Kerala and in the union

territories of Lakshadweep and Puducherry.

It belongs to the Dravidian family of

languages. The origin of Malayalam,

whether it was from a dialect of Tamil or an

independent offshoot of the Proto Dravidian

language, has been and continues to be an

engaging pursuit among comparative

historical linguists.

Tamil is a Dravidian language spoken

predominantly by Tamil people of South

India and North-east Sri Lanka. It has official

status in the Indian states of Tamil Nadu,

Puducherry and Andaman and Nicobar

Islands. Tamil is also an official language of

Sri Lanka and an official language of

Singapore. It is legalised as one of the

languages of medium of education in

Malaysia along with English, Malay and

Mandarin. It is also chiefly spoken in the

states of Kerala, Karnataka, Andhra Pradesh

and Andaman and Nicobar Islands as one of

the secondary languages. It is one of the 22

scheduled languages of India and was the

first Indian language to be declared a

classical language by the Government of

India in 2004.

Both Tamil and Malayalam are languages of

southern states of India. Even though

Malayalam is said to belong to Dravidian

family of languages,it is more similar to the

Arya language. When Aryans came at north-

east border of Bharatam, the people of

Harappa and Mohen Ja Daro moved to east

and south and they replanted their

civilisation in South India, based in Tamil

Nadu. Tamil is one of the oldest languages.

When Dravidians moved to south, there were

people of soil to receive them and help. The

geography helped them to receive Tamil.

Dravidians settled in Tamil Nadu and

developed the Tamil Literature and Tamil

Civilization.

In comparison to all Indian Languages,

Tamil has only 12 vowels and 18 consonants.

Malayalam is one of the most updated

languages with clarity in voice and

comparatively more difficult to study. There

are 56 letters. The vowels are most

resembling to Sanskrit. Since Tamil has only

few alphabets in comparison to Malayalam,

it is not possible to have a one to one

mapping between these languages. A letter in

Tamil can be transliterated as more than one

letters in Malayalam.

III. Applications and Challenges

The various applications of machine

transliteration system are as follows:

• It aids machine translation.

• It helps to eliminate language barriers.

CLEAR June 2014 28

• It supports localization.

• It enables the use of a keyboard in a

given script to type in a text in another one.

Transliteration also has many challenges.

They are:

• Dissimilar alphabet sets of the source

and target languages.

• Multiple possible transliterations for

the same word.

• Finding exactly matching tokens is

difficult for some of the vowels and a few

consonants.

• The size of the corpus required is

very large in order to build an accurate

transliteration system.

IV. Conclusion

In this article, we discussed about Tamil to

Malayalam transliteration systems in general.

Various applications of transliteration and the

challenges associated with it were also

pointed out. It is inevitable for a machine

translation system for handling named

entities. In case of a dictionary based

translation systems, this is very useful as it

would save lot of time and resources. But it

has been observed that a large corpus size is

required to model the system accurately.

REFERENCES

[1] K Saravanan, Raghavendra Udupa, A

Kumaran: Crosslingual Information

Retrieval System Enhanced with

Transliteration Generation and Mining,

Microsoft Research India, Bangalore, INDIA.

[2] Rishabh Srivastava and Riyaz Ahmad

Bhat: Transliteration Systems Across Indian

Languages Using Parallel Corpora,

Language Technologies Research Center,

IIIT-Hyderabad, India.

[3] R. Akilan and Prof. E.R.Naganathan:

Morphological Analyzer for Classical Tamil

texts: A rule-based approach

CLEAR June 2014 29

Memory-Based Language Processing And

Machine Translation

Nisha M

M.Tech CL

GEC, Sreekrishnapuram

Palakkad

I. Introduction

Memory-based language processing,

MBLP, is based on the idea that learning and

processing are two sides of the same coin.

Learning is the storage of examples in

memory, and processing is similarity-based

reasoning with these stored examples.

MBLP finds its computational basis

in the classic k-nearest neighbor classifier

(Cover & Hart, 1967). With k = 1, the

clasfier searches for the single example in

memory that is most similar to B, say A, and

then copies its memorized mapping A‘ to B‘

(as visualized schematically in Figure 1).

With k set to higher values, the k nearest

neighbors to B are retrieved, and some

voting procedure (such as majority voting)

determines which value is copied to B‘.

Memory-based machine translation

can be considered as a form of Example-

Based Machine Translation. In

characterizing Example-Based Machine

Translation, Somers (1999) refers to the

Memory-based language processing (MBLP) is an approach to language processing based on exemplar

storage during learning and analogical reasoning during processing. From a cognitive perspective, the

approach is attractive because it does not make any assumptions about the way abstractions are

shaped, and does not make any a priori distinction between regular and exceptional exemplars,

allowing it to explain fluidity of linguistic categories, and irregularization as well as regularization in

processing. Memory-based machine translation can be considered as a form of Example-Based

Machine Translation. Machine translation problem can be treated as a classification problem and

hence memory based learning can be applied. This paper demonstrates memory-based approach for

machine translation.

CLEAR June 2014 30

common use of a corpus or database of

already translated examples, and the process

of matching new input against instances in

this database. This matching is followed by

extraction of fragments which are then again

recombined to form the final translation.

The task of mapping one language to

the other can be treated as a classification

problem. The method can be described as

one in which the sentence to be translated is

decomposed into smaller fragments, each of

which is passed to a classifier for

classification. The memory-based classifier

is trained on the basis of a parallel corpus in

which the sentence pairs have been reduced

to smaller fragment pairs. The assigned

classes thus correspond to the translation of

the fragments. In the final step, these

translated fragments are re-assembled to

derive the final translation of the input

sentence.

II. Background and Related Literature

Natural language processing models

and systems typically employ abstract

linguistic representations(syntactic,

semantic, or pragmatic) as intermediate

working units. Memory-based models enable

asking the question whether we can do

without them, since any invented

intermediate structure is always implicitly

encoded somehow in the words at the

surface, and the way they are ordered, and

memory-based models may be capable of

capturing the knowledge that is usually

considered to be necessary, in an implicit

way, so that they do not need to be explicitly

computed.

Classes of natural processing tasks in

which this question can be investigated in

extreme are processes in which form is

mapped to form, i.e., in which neither the

input nor the output contains abstract

elements to begin with, such as machine

translation. Many current machine

translation tools, such as the open source

Moses toolkit (Koehn et al., 2007), indeed

implement a direct mapping of source to

target text, leaving all of syntax and

semantics implicit; they hide in the form of

statistical translation models between col

locationally strong phrases, and of statistical

language models of the target language.

MBLP approach on this problem involves

using context on the source side, and using

memory-based classification as a translation

model (Van Gompel, Van den Bosch, &

Berck,2009).

There is an encouraging number of

recent studies that attempt to link statistical

CLEAR June 2014 31

and memory-based models of language that

focus on discovering strong n-grams (for

phrase-based statistical machine translation

or for statistical language modeling) to the

concept of constructions and to the question

to what extent human language users exploit

constructions. To mention two, we note that

Mos, Van den Bosch, and Berck have

reported that a memory-based language

model shows a reasonable correlation with

unit segmentations that test subjects generate

in a sentence copy task; the model implicitly

captures several strong complex lexical items

(constructions), although it fails to capture

long distance dependencies, a common issue

with local n-gram based statistical models. In

a related study, (Arnon & Snider,2010) show

that subjects are sensitive to the frequency (a

rough approximation of collocational

strength) of four-word n-grams such as

‗don‘t have to worry‘, which are processed

faster when they are more frequent. Their

argument is again focused on the question

whether strong subsequences need to have

linguistic structure that assume hierarchy, or

could simply be taken to be flat n-grams — it

is exactly this question that we aim to

explore further in our work with memory

based language processing models.

III. Methodology

The process of translating a new

sentence is divided into a local phase

(corresponding to the first two steps in the

process) in which memory-based

translation of source trigrams to target

trigrams takes place, and a global phase

(corresponding to the third step) in

which a translation of a sentence is

assembled from the local predictions.

A.Local classification

Both in training and in actual

translation, when a new sentence in the

source language is presented as input, it is

first converted into windowed trigrams,

where each token is taken as the center of a

trigram once. First trigram of the sentence

contains an empty element, and the last

trigram contains an empty right element. At

training time, each source language sentence

is accompanied by a target language

translation. Word alignment should be

performed before this step, so that classifier

know for each source word whether it maps

to a target word, and if so, to which. Given

the alignment, each source trigram is mapped

to a target trigram of which the middle word

is the target word to which the word in the

middle of the source trigram aligns and right

neighboring words of the target trigram are

CLEAR June 2014 32

the center word‘s actual neighbors in the

target sentence.

Figure2: an example training pair of

sentences, converted into six

overlappingtrigrams with their aligned

trigram translations.[1]

When translating new text, trigram outputs

are generated for all words in each new

source language sentence to be translated,

since our system does not have clues as to

which words would be aligned by statistical

word alignment.

B.Global search

To convert the set of generated target

trigrams into a full sentence translation, the

overlapbetween the predicted trigrams is

exploited. Figure 3 illustrates a perfect case

of a resolutionof the overlap (drawing on the

example of Figure 2), causing words in the

English sentence tochange position with

respect to their aligned Dutch counterparts.

First three English tri-grams align one-to-one

with the first three Dutch words. Fourth

predicted English trigram, however, overlaps

to its left with the fifth predicted trigram, in

one position, and overlaps in two positions to

the right with the sixth predicted trigram,

suggesting that this part of the English

sentence is positioned at the end. Note that in

this example, the ―fertility‖ words take and

this, which are not aligned in the training

trigram mappings (cf. Figure 1), play key

roles in establishing trigram overlap.

IV. Conclusion and Future Works

The study described in this paper has

demonstrated how memory-based learning

can be applied to machine translation.

Memory based learning stores all examples

in memory it settles for a state of maximal

description length. This extreme bias makes

memory-based learning an interesting

comparative case against so-called eager

learners, such as decision tree induction

algorithms and rule learners. Phrase based

CLEAR June 2014 33

memory based machine translation is also

implemented which use Statistical toolkits

such as Moses for phrase extraction. Future

works include developing memory based

techniques for phrase extraction.

REFERENCES

[1] A. van den Bosch and P. Berck. Memory-

based machine translation and language

modeling. The Prague Bulletin of

Mathematical Linguistics, 91:17–26, 2009.

[2] Antal van den Bosch and Walter

Daelemans, Memory-Based Language

Processing, Cambridge University Press,

New York

[3] Antal van den Bosch and Walter

Daelemans, Implicit schemata and categories

in memory-based language processing

The Stanford Natural Language Processing Group

The Natural Language Processing Group at Stanford University is a team of faculty, research

scientists, postdocs, programmers and students who work together on algorithms that allow

computers to process and understand human languages. The work ranges from basic research in

computational linguistics to key applications in human language technology, and covers areas such

as sentence understanding, machine translation, probabilistic parsing and tagging, biomedical

information extraction, grammar induction, word sense disambiguation, and automatic question

answering. The Stanford NLP Group makes parts of their Natural Language Processing software

available to everyone. These are statistical NLP toolkits for various major computational linguistics

problems. They can be incorporated into applications with human language technology needs. A

distinguishing feature of the Stanford NLP Group is their effective combination of sophisticated and

deep linguistic modeling and data analysis with innovative probabilistic and machine learning

approaches to NLP. Our research has resulted in state-of-the-art technology for robust, broad-

coverage natural-language processing in many languages. These technologies include their

competition-winning coreference resolution system; a state-of-the-art part-of-speech tagger; a

high performance probabilistic parser; a competition-winning biological named entity recognition

system; and algorithms for processing Arabic, Chinese, and German text. All the software is written

in Java. All recent distributions require Oracle Java 6+ or OpenJDK 7+. Distribution packages include

components for command-line invocation, jar files, a Java API, and source code. A number of

helpful people have extended our work with bindings or translations for other languages. As a

result, much of this software can also easily be used from Python (or Jython), Ruby, Perl, Javascript,

and F# or other .NET languages.

Link: http://nlp.stanford.edu/software/

CLEAR June 2014 34

I. Introduction

India is a special nation, holding

highly linguistically diverse population with

18 officially recognized languages and other

unofficial languages. Diversity is more than

China (7 languages and hundreds of dialects)

though area size India covers only one third

of China.

A dialect is a form of a language

spoken in a particular geographical area or

by members of a particular social class or

occupational group, distinguished by its

vocabulary, grammar, and pronunciation.

The term is applied most often to regional

speech patterns, but a dialect may also be

defined by other factors, such as social class.

A dialect that is associated with a particular

social class can be termed a Sociolect, a

dialect that is associated with a particular

ethnic group can be termed as Ethnolect, and

a regional dialect may be termed a

regiolect or topolect. According to this

definition, any variety of a language

constitutes "a dialect", including any

standard varieties. A standard dialect is a

dialect that is supported by institutions.

Slang consists of words, expressions,

and meanings that are informal, and are used

by people who know each other very well or

who have the same interests. It include

mostly expressions that are not considered

appropriate for formal occasions; often

vituperative or vulgar. Use of these words

and phrases is typically associated with the

subversion of a standard variety and is likely

to be interpreted by listeners as implying

particular attitudes on the part of the speaker.

In some contexts a speaker's selection of

slang words or phrases may convey prestige,

indicating group membership or

Abstract: A regional or social variety of a language distinguished by pronunciation, grammar, or

vocabulary, especially a way of speaking that differs from the standard variety of the language. It is

a recognized and formal variant of the language spoken by a large group of one region, class or

profession. Whereas slang, that is different from dialect, consists of a lexicon of non-standard words

and phrases in a given language. Dialect resolution is an approach to represent a dialect dialog/word

into its formal format without losing its semantic. It is a localized approach, through which a local

person can express his idea with his own style into a formal format. It is also a method to resolve the

slang words also.

Dialect Resolution

Manu.V.Nair, Sarath K.S Department of Computer Science and Engineering

Govt. Engg. College Sreekrishnapuram Palakkad, India -678 633

CLEAR June 2014 35

distinguishing group members from those

who are not a part of the group.

Among Indian languages, Malayalam

is a highly inflectional language, distinct due

to political and geographical isolation, the

impact of Christianity and Islam, and the

arrival of the Namboothiri Bhrahmins a little

over thousand years ago, all created

conditions favourable to the development of

the local dialect Malayalam. The

Namboothiri grafted a good deal of Sanskrit

on to the local dialect and influenced its

physiognomy. Malayalam language itself

contain many number of dialect variations.

Each one of the districts of Kerala have their

own dialect changed languages.

II. Dialect Resolution

Dialect resolution is an approach to

represent a dialect dialog/word into its

formal format without losing semantic. It is a

localized approach, through which a local

person can express his idea with his own

style into a formal format. Highly informal

slang words also to be resolved.

From the computational point of

view, Dialect Resolution is one of the

difficult task. There are different types of

dialect variations even for a single language.

Computational methods for dialect resolution

includes rule based, statistical based,

machine learning approaches. We can also

use a hybrid approach, ie. mixture of the

above mentioned approaches. A hybrid

approach always have the positives of the

previous mentioned approaches. So, a hybrid

approach may have the higher accuracy since

it have all the positives of each of the single

approaches.

III. Applications

A. Localization

Language localization is the process

of adapting a product that has been

previously translated into different

languages to a specific country or

region.Localization can be done for

regions or countries where people speak

different languages or where the same

language is spoken

Dialect Resolution have an important

role in localization. Since the rich

morphological language Malayalam have

different types of dialects, we have to

incorporate dialect resolution to the

localization process. Different tribal

people can express their ideas &

knowledge to the outer world with their

own language. The Dialect resolution

system will convert it into the formal

CLEAR June 2014 36

language that they belongs to. This will

attract them to engage with outer world,

mainly the government processes &

services that aiming to them. They will

feel free to use their mother tongue.

B. Machine Translation

Machine Translation is the sub-field

of computational linguistics that

investigates the use of software to

translate text or speech from one natural

language to another.

Dialect Resolution is very much

useful when a story in a colloquial is

translated to another language. First, we

can go through the Dialect Resolution

process and make the story in a normal

language format.

Then, it is given to the machine

translation system. Without the help of a

dialect resolution system, it will be

difficult to deal with such colloquial

works. We can implement similar

process in the case of old transcripts. Old

transcripts may contain the details of rare

medicines, several histories etc. These

Information can be translated to another

language only with the support of a

dialect resolving system.

C. Speech to text applications

The application of Dialect Resolution

system in the field of speech to text

application is crucial. If the speech to

text application is possible with certain

good accuracy, then that text may contain

the dialect words. It must be resolved to

use it as a formal one. It is very useful,

when the text in local language can be

converted into formal language and

through that into another languages. This

is very useful in the places like

parliament, legislative assemblies etc.

IV. Issues in Dialect Resolution

As already said, Dialect Resolution is not

an easy task. It is a difficult one, and it have

different issues on implementation.

A. Input Level

Thrissur dialect can be simple

and more complex with compound

and slang words. Some of the inputs

like, sentence containing metaphor

expressions, named entity, large

compound words, or redundant

words, are more complex to resolve

together. Each of them should be

handled separately. As all know,

Malayalam is a highly agglutinative

language with free

order character. Almost many slang

words have change at tail, and

CLEAR June 2014 37

remaining part with its context words

will provide clue about actual word.

B. Machine Learning Level

The dialect patterns are

learned through a learning process.

Smaller the corpus for learning,

lesser will be the accurate for

unknown items. Selection of method

for machine learning is also a crucial

factor. Different learners perform

differently on same corpus. Getting

feature annotated corpus is also a big

task. Named entities in training data

will miss-lead, they have to be

transliterated.

C. Corpus Level

Availability of very large

Malayalam formal-sentence corpus is

a big issue. Also, a dictionary for all

slang words are also a problem.

D. Output Level

Keeping naturality in output

is great challenge. For handling

ambiguous words, need context

informations. Even though

Malayalam is a highly free order

language, for generating semantically

correct sentences we need a proper

word ordering.

V. Conclusion & Future works

It is a great step in localized language

resolution in Malayalam language. The

ultimate possibilities are, bringing all dialect

ideas and informations to a formal format,

which will be readable and understandable to

others without any language boundary within

the language. After that, translation to other

language will be more easier.

Dialect resolution in thrissur slang

can be adapted in other slang of other

language, with sufficient support of corpus.

As an extension of this approach, gender,

tense, person and number informations are

labeled for disambiguation. It will be better

with large corpus of formal sentences and

dictionary with all slang words. A better

machine learning technique like TnT, SVM

or CRF with this large corpus will provide

more better results.

VI. REFERENCES

[1] Daniel Jurafsky and James H Martin.

Speeh and Language Processing: An

Introduction to Natural Language

Processing, Computational Linguistics and

Speech Recognition, Prentice Hall, 1999.

[2] Thorsten Brants, TnT – A Statistical Part-

of-Speech Tagger, Saarland University,

Computational Linguistics, 2000.

CLEAR June 2014 38

M.Tech Computational Linguistics

Department of Computer Science and Engineering

2012-2014 Batch

Details of Master Research Projects

Title Spoken Language Identification

Name of

Student Abitha Anto

Abstract Spoken Language Identification(LID) refers to the automatic process through

which the identity of a language spoken in a speech sample is determined. This

project is based on the phonotactic approach. The phone/phoneme sets are

different from one language to another. Also the frequency of occurrences of

certain phone/phoneme differs from one language to another. Based on the phone

sequences of each language, a language dependent n-gram probability

distribution model is estimated. The language identification is done by

comparing the frequency of occurrences of certain phone or phone sequences

with that of the target languages. The applications of LID system fall into two

categories - preprocessing for machine understanding systems and preprocessing

for human understanding systems. This project tries to identify the Indian

languages such as English (Indian), Hindi, Malayalam, Tamil and Kannada.

Tools HTK, SRILM, Matlab

Place of

Work

Amrita Vishwa Vidyapeetham, Coimbatore.

Title Statistical Approach to Anaphora Resolution for Improved Summary Generation

Name of

Student Ancy Antony

Abstract Anaphora resolution deals with finding of noun phrase antecedents of the

anaphors. It is used in the natural language processing applications such as text

summarization, information extraction, question answering, machine translation,

topic identification etc. The project aims to resolve the mostly occurring pronouns

in the documents such as third person pronouns. A statistical approach is utilized

to perform the accomplished task. Important features are extracted and they are

used for finding the antecedents of anaphors. The proposed system includes the

components such as pronoun extractor, noun phrase extractor, gender classifier,

subject identifier, named entity recognizer, chunker and part-of-speech tagger.

Tools NLTK, TnT, Stanford Parser

Place of

Work

GEC Sreekrishnapuram

CLEAR June 2014 39

Title Anaphora Resolution in Malayalam

Name of

Student Athira S

Abstract An anaphor is a linguistic entity which indicates a referential tie to some other

linguistic entity in the same text. Anaphora Resolution is a process of

automatically finding the pairs of pronouns or noun phrases in a text that refer to

same incidence, thing, person, etc. called referent. The first member of the pair is

called antecedent and the next member is called anaphora. This project tries

to resolve anaphora in Malayalam. We outline an algorithm for anaphora

resolution, while working from the output of a Subject-Verb-Object tagger. And

also Person-Number-Gender agreement is also included in the exisiting tagging

system. Anaphora resolution is done based on the tagging and the degree of

salience(salience value). The anaphora resolution system itself can improve the

performance of many NLP applications such as text summarisation, term

extraction and text categorisation.

Tools TnT

Place of

Work

IIITMK, Thiruvananthapuram, Kerala

Title Extractive News Summarization using Fuzzy Graph based Document Model

Name of

Student Deepa C A

Abstract This system describes a news summarization system based on Fuzzy Graph

Document Models. Modelling documents to Fuzzy Graphs is used to summarize a

set of similar news paper articles. Each article is represented as a fuzzy graph,

whose nodes represent sentences and edges connecting nodes are present if there

exists a similarity between those sentences. This proposed extractive document

summarizer uses fuzzy similarity measure to weight the edges.

Tools WebScrapy, NLTK

Place of

Work

Government Engineering College, Sreekrishnapuram,

Title Text Summarization using Machine Learning Approach

Name of

Student Sreeja M

Abstract This project aims the comparison of summarization algorithms using Rouge

toolkit in DUC 2001 dataset and development of a new algorithm for

summarization and the comparison with the previous works.

Tools Eclipse, Rouge, R-Studio

Place of

Work

Centre for Artificial Intelligence and Robotics , DRDO, Bangalore

CLEAR June 2014 40

Title Fuzzy model-based emotion and sentiment analysis of text documents

Name of

Student Divya M

Abstract Computer systems are inevitable in almost all aspects of everyday life. With the

growth of artificial intelligence , the capabilities and functionalities of computer

systems have been enhanced. Emotions constitute a key factor in human

communication.Human emotion can be expressed through various mediums such

as speech, facial expressions, gestures and textual data. Emotion and sentiment

analysis of text is a growing area of interest in the field of computational

linguistics. Emotion detection approaches use or modify concepts and general

algorithms created for subjectivity and sentiment analysis. In this paper the

emotion and sentiment of text is analysed by means of fuzzy approach. The

proposed method involves the construction of a knowledge base of words which

are known to have emotional content for representing six basic emotions:

anger,fear,joy,sadness,surprise and disgust. The system takes input natural

language sentences, analyses them and determines the underlying emotion. It also

represents the multiple emotion contained in text document. Experimental results

indicate quite satisfactory performance.

Tools nltk,stanford parser

Place of

Work

Government Engineering College, Sreekrishnapuram

Title Ontology based information retrieval system in legal documents

Name of

Student Gopalakrishnan. G

Abstract Ontology serves as knowledge base for any domain which is used by agents to

mine relationships and dependencies, and/or to answer user queries. The domain

of focus possibly holds a valid structure and hierarchy. Indian Penal Code (IPC) is

one such realm which is the apex criminal code of India, fortified into five

hundred and eleven sections under twenty three chapters. Ontology for IPC helps

to create a vista which enables legal persons as well as common man to access the

intended sections and the code in the easiest way. Ontology also provides the

judgments produced over each code which make IPC more transparent and close

to people. Protege and OWL are used to develop the ontology. Once completed,

the ontology serve as an integral reference point for the elite legal community of

the country, and can be applied to information retrieval, decision support/making,

agent technology and question answering.

Tools Protege, Apache Jena, Python regular expression

Place of

Work


CLEAR June 2014 41

Title Topic Detection using LDA algorithm

Name of

Student Indu M

Abstract Probabilistic topic models are a suite of algorithms whose aim is to discover the

hidden thematic structure in large archives of documents. Topic Modelling can

enhance information network construction by grouping similar objects, event

types and roles together. The documents is considered as a distribution on a small

number of topics, where each topic is a distribution on words. So here the main

aim is to find the most probable topics and its corresponding word distribution

over the topics. One of the main algorithm used here is LDA(latent Dirichlet

allocation ), which is a generative model that allows sets of observations to be

explained by unobserved groups that explain why some parts of the data are

similar.

Tools pythonNLTK,Gensim

Place of

Work


Title Discourse Segmentation using Anaphora Resolution in Malayalam

Name of

Student Lekshmi T S

Abstract A number of linguistic devices are employed in text-based discourse for the

purposes of introducing, defining, refining, and reintroducing discourse entities.

This project looks at one of the most pervasive of these mechanisms, anaphora,

and addresses the question of how discourse is maintained and segmented properly

by resolving anaphora in Malayalam. An anaphor is a linguistic entity which

indicates a referential tie to some other linguistic entity in the same text. The

behaviour of referring expressions throughout the discourse seems to be closely

correlated with the segment boundaries. In general, within the limit of a segment,

referring expressions adopt a reduced form such as pronouns, whereas, a reference

across a discourse boundaries tends to be realized via unreduced forms like

definite descriptions and proper names.We outline an algorithm for anaphora

resolution, while working from the output of a Subject-Verb-Object tagger. Next,

by positioning the anaphoric references, segment boundaries are identified. The

focus of this project is on anaphora resolution as an essential prerequisite to build

the discourse segments of text. The anaphora resolution system itself can improve

the performance of many NLP applications such as text summarisation, term

extraction and text categorisation.

Tools TnT

Place of

Work

IIITM-K, Trivandrum

CLEAR June 2014 42

Title Speaker Verification using i-vectors

Name of

Student

Neethu Johnson

Abstract Speaker verification is the process of verifying the claimed identity of a speaker

based on the speech signal from the speaker. I-vector is an abstract representation

of speaker utterance. In this modeling, a new low dimensional speaker and

channel dependent space is defined using a simple factor analysis. This space is

named the total variability space because it models both speaker and channel

variabilities. Each speaker utterance is represented by an i-vector in total

variability space. We have to carry out channel compensation in the total factor

space rather than in the GMM supervector space followed by a scoring technique.

So, the i-vectors can be seen as new speaker recognition features where the factor

analysis plays the role of feature extractor rather than modeling speaker and

channel effects. The use of the cosine kernel as a decision score for speaker

verification makes the process faster and less complex than other scoring methods.

Tools HTK, LIA-RAL, ALIZE , Matlab

Place of

Work

Amrita Viswa Vidyapeetham, Coimbatore.

Title Topic Identification Using Fuzzy Graph Based Document Model

Name of

Student

Reshma O.K.

Abstract Fuzzy graphs can be used to model the paragraph-paragraph correlation in

documents. Using this model, nodes of the graph represent the paragraphs and the

interconnections represent the correlation. This project aims to find the topic from

the fuzzy graph model of the document. Topic identification refers to

automatically finding the topics that are relevant to a document. The task of topic

identification goes beyond keyword extraction, since relevant topics may not be

necessarily mentioned in the document or its title, and instead have to be obtained

from the context or the concept underlying in the document. The proposed system

uses fuzzy graph for modelling the document. Then using eigen analysis of the

correlation values from the fuzzy graph, the important paragraph can be found out.

The terms and their synonyms in that paragraph are then mapped to predefined

concepts. Thus the topic can be extracted from the document content. Topic

identification can assist search engines in the retrieval of documents by enhancing

the relevancy measures.

Tools NLTK, Stanford CoreNLP, Gensim, Stanford Topic Modelling Toolbox

Place of

Work


CLEAR June 2014 43

Title Ontology generation from Unstructured Text using Machine Learning

Methods

Name of

Student

Nibeesh K

Abstract This paper presents a system for ontology instance extraction that is automatically

trained to recognize ontology instances using statistical evidence from a training

set. This approach has several advantages: first, it eliminates the need for expert

language- specific linguistic knowledge. To train the automated system, users only

need to tag ontology instances. If the users needs change, the system can relearn

from new data quickly. Second, system performance can be improved by

increasing the amount of training data without requiring extra expert knowledge.

Third, if new knowledge sources become available, they can easily be integrated

into the system as additional evidence.

Tools GATE, Protege, Jena, OWL2,JAVA

Place of

Work


Title TEXT CLASSIFICATION USING STRING KERNELS

Name of

Student

Varsha K V

Abstract Text Classification is the class of assigning predefined categories to free text

documents. In this task, the text documents are represented as feature vectors of

high dimension. Where the feature values can be n-grams, named entities, words

etc,.. Kernel Methods(KMs) makes use of kernel functions which gives the inner

product of the document feature vectors. The KMs computes the similarity

between text documents without explicitly extracting the features. Thus KMs are

considered as an effective alternative to explicit feature extraction based

classification. The project makes use of String Kernels for text classification. The

String Kernel computes the similarity between two documents by the substring

they contain. The n-gram kernel and gappy n-gram kernels are being used for the

classification. In KMs the learning then takes place in the feature space, learning

algorithms can be applied to this space. The project makes use of Support Vector

Machine algorithm. Where SVMs are a class of algorithms that combine the

principles of statistical learning theory with optimisation techniques and the idea

of a kernel mapping. The non-dependence of KMs on dimensionality of the

feature space and flexibility of using any kernel function makes them a good

choice for text classification.

Tools Openkernel, OpenFST, Libsvm.

Place of

Work

Amrita Vishwa Vidyapeetham ,Ettimadai, Coimbathore.

CLEAR June 2014 44

Title Domain Specific Information Extraction: A Comparison using different Machine

Learning Methods

Name of

Student Prajitha u

Abstract Information extraction is the task of extracting structured information from

unstructured text. It has been widely used in various research areas including NLP

and IR. In this proposed system, a comparison of different machine learning

methods like TnT, SVM, CRF and TiMBL is done to extract perpetrator entity

from the opensource corpus of MUC34, which contain newswire articles of Latin

American terrorism.

Tools TnT, SVM, CRF, TiMBL

Place of

Work


Title Eigen Analysis Based Automatic Document summarization

Name of

Student

Sruthimol M P

Abstract Automatic document summarization is the process of reducing a text document

into a summary that retains the points highlighting the original document. Ranking

sentences according to their importance of being part in the summary is the main

task in summarization. This paper proposes an effective approach for document

summarization by sentence ranking. Sentence ranking is done by vector space

modeling and eigen analysis methods. Vector space model is used for representing

sentences as vectors in an n-dimensional space assigning tf-idf weighting system.

The principal eigen vectors of the characteristic equation will rank the sentences

according to their relevance. Experimental result using standard test collections of

DUC2001 corpus and Rouge Evaluation system shows that the proposed sentence

ranking based on eigen analysis scheme improves conventional Tf-Idf language

model based schemes.

Tools NLTK, Stanford CoreNLP, ROUGE Tool kit

Place of

Work

GEC Sreekishnapuram

Title SPEECH NONSPEECH DETECTION

Name of

Student SINCY V. THAMBI

Abstract Speech nonspeech detection is for identifying pure speech and to ignore

nonspeech which includes music, noise, various environmental sounds, silence

etc. Speech nonspeech detection is performed by analysing those features which

enables to distinguish them efficiently. Features includes various time domain,

frequency domain and cepstral domain features are analysed for short time frames

of 20ms size along with their mean and standard deviation for segments of size

200ms. Among them best features are extracted using various feature

dimensionality reduction mechanisms. It is noted that an accuracy of 95.085% is

obtained for 2 hours speech-nonspeech database by using decision tree approach.

CLEAR June 2014 45

Tools Matlab, weka

Place of

Work

Amritha Viswa Vidyapeetham, Coimbatore

Title Automatic Information Extraction and Visualization from Defence-related

Knowledge Bases for Effective Entity Linking

Name of

Student Sreejith C

Abstract The project aims to develop an intelligent information extraction system capable

of extracting entities and relationships from natural language reports using

statistical and rule based approaches In this project, the knowledge about the

domain is imparted to the machine in the form of ontology. The extracted entities

are stored into a knowledge base and further visualized as graph based on the

relationships existing between them for effective entity linking and inference.

Tools Java, Jene, Protege,CRF++

Place of

Work


CLEAR June 2014 46

Article Invitation for CLEAR- September-2014

We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted

aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in

Engineering And Research) magazine, publishing on September 2014. The suggested areas of discussion

are:

The articles may be sent to the Editor on or before 10th September, 2014 through the email

[email protected].

For more details visit: www.simplegroups.in

Editor, Representative,

CLEAR Magazine SIMPLE Groups

M.Tech Computational Linguistics

Dept. of Computer Science and Engg,

Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in [email protected]

SIMPLE Groups Students Innovations in Morphology

Phonology and Language Engineering

http://www.simplegroups.in/

http://www.simplegroups.in/

mailto:[email protected]

CLEAR June 2014 47

Hi,

Computational Linguistics is an emerging and promising discipline in

shaping future research and development activities in academia and industry,

in fields ranging from Natural Language Processing, Machine Learning,

Algorithms, Data Mining etc. Whilst considerable progress has been made in

the development of Artificial Intelligence, Human Computer Interaction etc

the construction of software that can understand and process human language

still remains a challenging area. New challenges arise in the modelling of such

complex systems, sophisticated algorithms, advanced scientific and

engineering computing and associated problem-solving environments.

CLEAR is designed to inform readers on the state of art in a number of

specialized fields related to Computational Linguistics. CLEAR addresses the

state of the art of all aspects of Computational Linguistics, highlighting

computational methods and techniques for science and engineering

applications. CLEAR is a platform for all, ranging from academic researcher

and professional communities to industrial professionals in a range of number

of topics in the field of Natural Language Processing and Computational

Linguistics. CLEAR welcomes thought-provoking articles, interesting

dialogues and healthy debates on multifaceted aspects in all areas covered,

which includes audio, speech, and language processing and the sciences and

technologies that support them.

Thank you for your time and consideration.

Sreejith C

CLEAR June 2014 48

CLEAR June 2014 49

Students' Innovation in Morphology Phonology and Language

Engineering (Abbreviated as SIMPLE) is the official website of M.Tech

computational Linguistics, Govt. Engineering College, Palakkad. As the name

indicates, SIMPLE is a platform for showcasing our innovations, ideas and

activities in the field of Computational Linguistics. The applications of AI

become much effective, when systems incorporate natural language

understanding capabilities. Here, we are trying to explore how human

language understanding capabilities can be used to model an intelligent

behavior in computer. We hope our pursuit of excellence will ultimately benefit

the society as we believe “Innovation brings changes to the society.”

SIMPLE has plans and proposals for the active participation in the research as

well as the applications of Computational Linguistics. The association is

interested in organizing seminars, workshops and conferences in this area and

also looking forward to actively network with the people and organizations in

this field.

Our activities are led by the common philosophy of innovation, sharing and

serving the society. We plan to bring out the association magazine that

explains the current technology in a CL perspective.

www.simplegroups.in

CLEAR June 2014 50

Clear june 2014

Documents

Transcript of Clear june 2014