Development of analysis rules to identify proper noun from bengali sentence for universal networking...

42
Development of Analysis Rules to Identify Proper Noun from Bengali Sentence for Universal Networking language Prepared By- Md. Syeful Islam Sr. Software Engineer at- Samsung Research Institute Bangladesh Google scholar : http://scholar.google.com/citations?user=b0QDD2AAAAAJ&hl=en academia.edu: https://independent.academia.edu/SyefulIslam

description

- Today the regional economies, societies, cultures and educations are integrated through a globe-spanning network of communication and trade. - This globalization trend evokes for a homogeneous platform so that each member of the platform can apprehend what other intimates and perpetuates the discussion in a mellifluous way. - However the barriers of languages throughout the world are continuously obviating the whole world from congregating into a single domain of sharing knowledge and information. - Therefore researcher works on various languages and tries to give a platform where multi lingual people can communicate through their native language. - Researcher analyze the language structure and form structural grammar and rules which used to translate one language to other. - From the last few years several language-specific translation systems have been proposed. - Since these systems are based on specific source and target languages, these have their own limitations. - As a consequence United Nations University/Institute of Advanced Studies (UNU/IAS) were decided to develop an inter-language translation program . - The corollary of their continuous research leads a common form of languages known as Universal Networking Language (UNL) and introduces UNL system. - UNL system is an initiative to overcome the problem of language pairs in automated translation. UNL is an artificial language that is based on Interlingua approach.UNL acts as an intermediate form computer semantic language whereby any text written in a particular language is converted to text of any other forms of languages. - UNL system consists of major three components: - language resources - software for processing language resources (parser) and - supporting tools for maintaining and operating language processing software or developing language resources. - The parser of UNL system take input sentence and start parsing based on rules and convert it into corresponding universal word from word dictionary. - The challenge in detection of named is that such expressions are hard to analyze using UNL because they belong to the open class of expressions, i.e., there is an infinite variety and new expressions are constantly being invented. - Bengali is the seventh popular language in the world, second in India and the national language of Bangladesh. - So this is an important problem since search queries on UNL dictionary for proper nouns while all proper nouns(names) cannot be exhaustively maintained in the dictionary for automatic identification. In this research project we do this task , Proper noun detection and conversion.

Transcript of Development of analysis rules to identify proper noun from bengali sentence for universal networking...

  • 1. Development of Analysis Rules to IdentifyProper Noun from Bengali Sentence forUniversal Networking languagePrepared By-Md. Syeful IslamSr. Software Engineer at-Samsung Research Institute BangladeshGoogle scholar : http://scholar.google.com/citations?user=b0QDD2AAAAAJ&hl=enacademia.edu: https://independent.academia.edu/SyefulIslam

2. Presentation covered Overview Introducing with Universal NetworkingLanguage(UNL) Proposed Proper Noun Conversion Approach Result Analysis 3. Overview Today the regional economies, societies, cultures and educations are integrated through aglobe-spanning network of communication and trade. This globalization trend evokes for a homogeneous platform so that each member of theplatform can apprehend what other intimates and perpetuates the discussion in amellifluous way. However the barriers of languages throughout the world are continuously obviating thewhole world from congregating into a single domain of sharing knowledge and information. Therefore researcher works on various languages and tries to give a platform where multilingual people can communicate through their native language. Researcher analyze the language structure and form structural grammar and rules whichused to translate one language to other. From the last few years several language-specific translation systems have been proposed. Since these systems are based on specific source and target languages, these have their ownlimitations. As a consequence United Nations University/Institute of Advanced Studies (UNU/IAS) weredecided to develop an inter-language translation program . The corollary of their continuous research leads a common form of languages known asUniversal Networking Language (UNL) and introduces UNL system. 4. Overview UNL system is an initiative to overcome the problem of language pairs in automated translation.UNL is an artificial language that is based on Interlingua approach.UNL acts as an intermediateform computer semantic language whereby any text written in a particular language is convertedto text of any other forms of languages. UNL system consists of major three components:- language resources- software for processing language resources (parser) and- supporting tools for maintaining and operating language processing software ordeveloping language resources. The parser of UNL system take input sentence and start parsing based on rules and convert it intocorresponding universal word from word dictionary. The challenge in detection of named is that such expressions are hard to analyze using UNLbecause they belong to the open class of expressions, i.e., there is an infinite variety and newexpressions are constantly being invented. Bengali is the seventh popular language in the world, second in India and the national language ofBangladesh. So this is an important problem since search queries on UNL dictionary for proper nouns while allproper nouns(names) cannot be exhaustively maintained in the dictionary for automaticidentification.In this research project we do this task , Proper noun detection and conversion. 5. Introducing with Universal NetworkingLanguage(UNL)What is the UNL? The UNL consists of Universal words (UWs), Relations, Attributes, and UNLKnowledge Base. The Universal words constitute the vocabulary of the UNL,Relations and attribute constitutes the syntax of the UNL and UNL Knowledge Baseconstitutes the semantics of the UNL.Why the UNL is necessary? A computer in future needs a capability to make knowledge processing.Knowledge processing means a computer takes over thought and judgment ofhumans using knowledge of humans. It is necessary to make a processing basedon contents. Computers need to have knowledge for knowledge processing. It isnecessary for computers to have a language to have knowledge like human. It isalso necessary to have a language to process contents like human. The UNL is alanguage for computers to do so. The UNL can express knowledge like a natural language. The UNL can expresscontents like a natural language. 6. Introducing with Universal NetworkingLanguage(UNL)What is different from others? Systems which can deal with knowledge and contents have already been developed. But,their representation of knowledge or contents is different from each other. Moreover, theirrepresentations are language dependent. Namely, concept primitives used to representknowledge are language dependent. Knowledge or contents of a system cannot be used inother systems. The situation is same as machine translation. For example, if we put all the result of researchand development of machine translation, we cannot realize multilingual machine translationsystems which can break language barriers.Advantage of common language for computers The UNL greatly reduces development cost of developing knowledge or contents necessaryto make knowledge processing by sharing knowledge and contents. Furthermore, if everyknowledges necessary for doing something by software is described in a language forcomputers such as the UNL, software only need to interpret instructions written in thelanguage to perform it functions. And those instructions could be shared by other software.Then we can accumulate such knowledge for computer like a library for humans. 7. Introducing with Universal Networking Language(UNL)How the UNL express information? The UNL expresses information or knowledge in the form of semantic networkwith hyper-node. UNL semantic network is made up of a set of binary relations,each binary relation is composed of a relation and two UWs that old the relation.Expression of Binary Relation A binary relation of UNL is expressed in the following format: ( , ) In , one of the relations defined in the UNL Specifications is described. In and , the two UWs that hold the relation given at aredescribed. Such a binary relation is interpreted as follows:Interpretation of Binary Relation The semantic network [4] of UNL expression is a directed graph by means of thebinary relations. The three elements of each binary relation have the followinginterrelationship: --- --> This interrelationship means that the UW given in Plays the role indicated by the relation given in and held by the UWgiven in ; whereas the UW given in holds the relation given in with the UW given in .As mentioned above, all binary relations that compose a UNL expression hasdirections, and the semantic network of UNL expression is a directed graph. 8. Introducing with Universal NetworkingLanguage(UNL)Hyper-graph The UNL expression is a hyper semantic network. That is, each node of the graph, and of a binary relation, can be replaced with a semantic network.Such a node consists of a semantic network of a UNL expression and is called ascope. A scope can be connected with other UWs or scopes. The UNLexpressions of in a scope is distinguished from others by assigning an ID to the of the set of binary relations that belong to the scope. The generaldescription format of binary relations for a hyper-node of UNL expression is thefollowing: :< scope-id> (, )Where, is the ID for distinguishing a scope. is not necessary tospecify when a binary relation does not belong to any scope. and can be a UW or a . A is given in the format of :< scope-id>.The UNL expresses information classifying objectivity and subjectivity. Objectivityis expressed using UWs and relations. Subjectivity is expressed using attributes byattaching them to UWs. 9. Introducing with Universal NetworkingLanguage(UNL)Figure: Sentence information is represented as a hyper-graph 10. Introducing with Universal NetworkingLanguage(UNL)UNL Expression[S:2]{org:es}Hace tiempo, en la ciudad de Babilonia, la gente comenzo a construir una torreenorme, que parecia alcanzar los cielos.{/org}{unl}tim(begin(icl>do).@entry.@past, long ago(icl>ago))mod(city(icl>region).@def, Babylon(icl>city))plc(begin(icl>do).@entry.@past, city (icl>region).@def)agt(begin(icl>do).@entry.@past, people(icl>person).@def)obj(begin(icl>do).@entry.@past, build(icl>do))agt(build(icl>do), people.@def)obj(build(icl>do), tower(icl>building).@indef)aoj(huge(icl>big), tower(icl>building).@indef)aoj(seem(icl>be).@past, tower(icl>building).@indef)obj(seem(icl>be).@past, reach(icl>come).@begin.@soon)obj(reach(icl>come).@begin-soon, tower(icl>building).@indef)gol(reach(icl>come).@begin-soon, heaven(icl>region).@def.@pl){/unl}[/S] 11. Proposed Proper Noun Conversion Approach The proper noun conversion is relatively simple inEnglish language, because such entities start with acapital letter. In Bangla we do not have concept of small or capitalletters. Thus we find difficulties in understanding whether aword is a proper noun or not. In this paper we define our work as two parts-- identified a proper noun from Bengali sentence and- convert this proper noun into UNL form. 12. Proposed Proper Noun Conversion Approach In our research project we proposed a method for UNLto proper noun conversion which is a combination of- dictionary-based and- rule-based approaches. In this approach, UNL EnConverter identifies theproper noun using rules and morphological analysis. The Post converter convert it to correspondinglanguage(English) Approaches are sequentially described into next slide. 13. Flow chart of proper noun ( ) conversionapproach 14. Proper Noun detection approach In UNL system firstly take an input sentenceand parse it word by word Search the word from dictionary the relativeword. If not found it try to recognize that thetemporary word is a proper noun based ondefine rules. If this process is fail then morphologicalanalysis is used. 15. Proper Noun detection approachThe approaches of proper noun detection are sequentiallydescribed with three steps-Dictionary based analysis for proper noun detectionRule-based analysis for proper noun detectionMorphological Analysis 16. Dictionary based analysis for propernoun detection If a dictionary is maintained where we try to attach most commonly usedproper noun and it is known as Name dictionary. Here we describe thedictionary based proper noun detection technique sequentially. Firstly the input sentence is processed on en-converter which finds theword on word dictionary. If the word is found then it is converted intorelative universal word. If not in dictionary then EnConverter creates atemporary entry for this word. Secondly the en-converter finds the word with flag TEMP into namedictionary. If it is found then it is conceder as the word may be noun andapply rules to ensure. Finally if the word is not in name dictionary then it sends intomorphological analyzer to conform that the word is proper noun. 17. Rule-based analysis for proper noundetection Rule-based approaches rely on some rules, one ormore of which is to be satisfied by the test word. Some words which use in Bangla sentence as a part ofname. Here we take a technique to identify proper nounusing such word (part of name). Firstly we make dictionary entry with pof (parts ofname) and other relevant attribute. To identify proper noun from Bangla sentence use pofword some typical rules are given next slide. 18. Rule-based analysis for proper noundetection Rule 1. If the parser find the following word like , , , , , , , ,, , , , , , , , , , , , etc. then the word isconsidered as the first word of name and set the status of the word first part of name (FPN). The word orcollection of words after FPN with status TEMP is also considered as parts of name. Rule 2. If the parser find the following word (title words and mid-name words to human names) like, , , , , , , , , , , , , , ,, , , , etc. and , , , , , , etc. after temporary entry word. Then LPN and temporary entry word along with such words are likely toconstitute a multi-word name(proper noun). For example, , are all name. Rule 3. If there are two or more words in a sequence that represent the characters or spell like thecharacters of Bangla or English, then they belong to the name. For example, (BA), (MA), (MBBS) are all name. Note that the rule will not distinguish between a proper name and commonname. 19. Rule-based analysis for proper noundetection Rule 4. If a substring like , , , , , , , , , occurs at the end ofthe temporary word, then temporary word is likely to be a name. Rule 5. If a word which is in temporary entry ended with -, , , , , , , , then the word is likely to be a name. Rule 6. If a word like - , , , , , , , , , , , , ,, is found after temporary word then NW along with such word may belong to name.For example - , , all are name. Rule 7. If the sentence containing , , , , , , ,, aftertemporary word ten the temporary word is likely to be a proper noun. Rule 8. If at the end of word there are suffixes like ,, , , , , , , ,, etc., and then word is usually not a proper noun. 20. Morphological Analysis for propernoun detectionWhen previous two steps fail to identify proper noun or there is confusionabout the word is proper noun or not then we apply morphologicalanalysis to sure that the unknown word is proper noun. In this case weconceder the structure of words and the position of word in the sentenceand identify the word type. Rule 1. Proper noun always ended with 1st, 2nd, 5th and 6th verbal inflexions(Bibhakti). So when parser fined an unknown word with 1st, 2nd, 5th and 6thbibhakti then the word may be proper noun. Rule 2. From sentence structure if parser find an unknown word is in theposition of karti kaarak and word is ended with 1st bibhakti then it isconceder as a proper noun. Rule 3. If the unknown word position is in the position of karma kaarakand it is indirect object and word is ended with 2nd bibhakti then it isconceder as a proper noun. But for direct object it is not a proper noun. 21. Morphological Analysis for propernoun detection Rule 4. If any sentence contains more than one word ended with 1st bibhakti thenthe first word is with flag unknown word then must be karti kaarak and the word isproper noun. Rule 5. In case of apaadaan kaarak, if any word in the sentence ended with 5thbibhakti and the word is unknown then it must be noun. If most of all others nounare in word dictionary so unknown word ended with 5th bibhakti must be propernoun. Rule 6. In case of adhikaran kaarak, if any word in the sentence ended with 7thbibhakti and the word is unknown then it may be the name of place. If the word isnot in dictionary then it is conceder as proper noun (name of any pace). 22. Morphological Analysis for propernoun detection In that time, when EnConverter identify an word or collection ofword as a proper noun and EnConverter convert into UNLexpression it keep track temporary word using custom UNL relationtpl and tpr. We use two relation tpr and tpl to identify the word whichshould converted. The relation tpr use when EnConverter finds the temporary wordafter pof (parts of name) attribute and tpr use when EnConverterfinds the temporary word before pof (parts of name ) attribute. When EnConverter found tpr relation then it converts the wordwhich is after blank space. For the case of tpl it converts the firstword. If proper noun contain only one word with attribute TEMP thenusing this TEMP attribute converter convert. 23. Function of Post Converter From previous slide we identify proper noun from banglesentence and convert the sentence into correspondingintermediate UNL expression. But there is little bit problem, EnConverter convert thoseword which is in word dictionary. In the case of temporaryword which is already identified as a proper noun or part ofproper noun cannot converted and it is same as in Bengalisentence. It is difficult to other people who cannot readbangle language. So it must be converted into English for UNL expression. Postconverters do this conversion. 24. Proposed enconversion processFigure: Architecture of Post converter (Bengali to English) 25. Proposed enconversion process (Algorithm) How post converter converts Bangla word for intermediate UNL expression to Englishfor final UNL expression. The process are describe step by step- Step 1: In first step the UNL expression is the inputs of post converter for find the finalUNL expression. Step 2: In second step post converter read the full expression and fined relation tpr ortpl. The relation tpr and tpl are used to identify the word which should convert. Step 3: At third step Post converter collect Bangla word which are converted withinthis post converter using the help of above two relations. When EnConverter foundtpr relation then it collects the word which is after blank space and for tpl it collectsthe first word. Step 4: In this steps converter convert the word applying rules and finding thecorresponding English syllable or word form syllable dictionary. Step 5: In this step we get the converted word which is placed on final UNL expression. Step 6: This steps is the final steps which generate final UNL expression. 26. Dictionary Entries for Post ConverterHere we have listed some dictionary entries for post converter. Table 1 shows theBengali vowel and table 2 shows the shows the Bengali consonant and thecorresponding entries in dictionary. In future we define the full phonetics forBengali to English conversion.Bangla vowel Dictionary entries []{} "a" (TEMP) []{} "a" (TEMP) []{} "i" (TEMP) []{} "ei" (TEMP) []{} "oo" (TEMP) []{} "u" (TEMP) []{} "rri" (TEMP) []{} "a" (TEMP) []{} "oi" (TEMP) []{} "o" (TEMP) []{} "ou" (TEMP) 27. Dictionary Entries for Post ConverterBangla consonant Dictionary entries []{} "k" (TEMP) []{} "kh" (TEMP) []{} "g" (TEMP) []{} "gh" (TEMP) []{} "ng" (TEMP) []{} "c" (TEMP) []{} "ch" (TEMP) []{} "j" (TEMP) []{} "jh" (TEMP) []{} "niya" (TEMP) []{} "t" (TEMP) []{} "th" (TEMP) []{} "d" (TEMP) []{} "dh" (TEMP) []{} "n" (TEMP) []{} "t" (TEMP) []{} "th" (TEMP) []{} "d" (TEMP) []{} "dh" (TEMP) []{} "n" (TEMP) []{} "p" (TEMP) []{} "f" (TEMP) []{} "b" (TEMP) []{} "v" (TEMP) []{} "mm" (TEMP) []{} "z" (TEMP) []{} "r" (TEMP) []{} "l" (TEMP) []{} "s" (TEMP) []{} "sh" (TEMP) []{} "s" (TEMP) []{} "h" (TEMP) []{} "r" (TEMP) []{} "rh" (TEMP) []{} "y" (TEMP) []{} "t" (TEMP) 28. Dictionary Entries for Post ConverterBangla parts of word Dictionary entries []{} "ka" (TEMP) []{} "ki" (TEMP) []{} "kei" (TEMP) []{} "koo" (TEMP) []{} "ku" (TEMP) []{} krriu" (TEMP) []{} "ka" (TEMP) []{} "koi" (TEMP) []{} "ko" (TEMP) []{} "kou" (TEMP) []{} "kra" (TEMP) []{} "kka" (TEMP) Similarly for all consonant it should need to entries in word dictionary. 29. Example... Lets an intermediate UNL expression-{unl}agt(read(icl>see>do,agt>person,obj>information).@entry.@present.@progress, : TEMP :05){/unl} Post converter firstly read the full sentence. When it finds the word with attribute TEMP converter collect this word and push it intoPost Converter. Then applying rules it convert into UNL word karim.Enconverter parse letter by letter. = + -> Ka = -> ri -> mThats mean converted in KarimThus Post converter converts all proper nouns Bengali to English. 30. Result Analysis To convert any Bangla sentence we have usedthe following files.Input fileRules fileDictionary We have used an Encoder (EnCoL33.exe) andhere I present print screen of enconversion. Screen print shows the Encoder that producesBangla to UNL expression or UNL to Bangla. 31. Result AnalysisFigure: Encoder for Enconversion 32. Result AnalysisWhen users click on convert button it generates corresponding UNL expression.The bellow screen shows this operation.Figure: Enconversion process 33. Result AnalysisWhen users click on convert button it generates corresponding UNL expression.The bellow screen shows this operation.Figure: Enconversion process 34. Result Analysis For example, , pronounce as KarimPoritechhe means Karim is reading. Here subject Kariminitiates an action. So, agent (agt) relation is made betweensubject Karim and verb read. The following dictionary entries are needed for conversionthe sentence [] {} karim (icl>name,iof>person,com>male)(N) [] {} read (icl>see>do,agt>person,obj>information)(ROOT,CEND,^ALT) [] {} (VI, CEND, CEG1, PRS, PRG, 3P) 35. Result Analysis 36. Result Analysis But is the name of a person. So it is not needed to entry ondictionary. To convert this sentence into UNL expression whenEnConverter finds word which is not in dictionaries then it creates atemporary entry. So if is not in dictionary then the temporary entryas like: [ ]{} "" (TEMP) ;.[]{} Applied rules: R{:::}{^PRON,^N,^VERB,^ROOT,^ADJ,^ADV,^ABY,^KBIV,^BIV:+N,+PROP::}(BLK)P10; After applying these rules it is conceder as a proper noun. And temporaryentry converted as: []{}"" (N,PROP,TEMP) ;.[]{} To convert this sentence into UNL expression morphological analysis ismade between (por) and (itechhe) and semantic analysis ismade between (karim) and (poritechhe) 37. Result Analysis Rule of morphological analysis: After applying some right shift rules when verb root (por)comes in the LAW and verbal inflexion (itechhe) comes in theRAW then following rule is applied to perform morphological analysis:+{ROOT,CEND,^ALT,^VERB:+VERB,-ROOT,+@::}{VI,CEND:::}P10; Rule of semantic analysis: After morphological analysis when appears in the LAW andverb (poritechhe) appears in the RAW the followingagent (agt) relation is made between (karim) and (poritechhe) to complete the semantic analysis. >{N,SUBJ::agt:}{VERB,#AGT:+&@present,+&@progress::}P1; where, NOM denotes the attributes for nouns or pronouns 38. Result Analysis{unl}agt(read(icl>see>do,agt>person,obj>information).@entry.@present.@progress, :TEMP: 05){/unl}The intermediate language UNL form shows that the word same as before conversion. When the sentence converted into another language then the word is notunderstandable. Because of, in UNL expression the word isBengali word. So this expression is invoked into Post Converter whichconverts this word into English using proposed simple phoneticmethod. 39. Result Analysis Post Converter takes the UNL expression andconverts the proper noun Bengali toEnglish karim. = + -> Ka -> ri -> mThats mean converted in karim 40. Result Analysis After conversion the final UNL expression islike as-{unl}agt(read(icl>see>do,agt>person,obj>information).@entry.@present.@progress, karim 05){/unl} 41. Conclusion In this paper we define our work as two parts, firstly identified aproper noun from Bengali sentence and secondly convert thisproper noun into UNL form. In the second parts we use a converter named as post converterwhich use simple phonetic method to convert Bengali to Englishand we only mention the simple procedure not complete. Our future plan in this regards is give the complete rules for postconverter. We will also works on Bengali language and give the completeanalysis rules for enconversion program to convert Bangla to UNLlanguage and generation rules for deconversion program to convertany UNL documents to Bangla language.http://www.mecs-press.org/ijmecs/ijmecs-v6-n8/v6n8-1.htmlhttp://www.academia.edu/7985833/Design_Analysis_Rules_to_Identify_Proper_Noun_from_Bengali_Sentence_for_Universal_Networking_Language 42. THANK YOU