Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

12
Joint Khmer Word Segmentation and Part-of-Speech Tagging Using Deep Learning Rina Buoy Nguonly Taing Sokchea Kor Techo Startup Center (TSC) Royal University of Phnom Penh (RUPP) {rina.buoy,nguonly.taing}@techostartup.center [email protected] Abstract Khmer text is written from left to right with optional space. Space is not served as a word boundary but instead, it is used for readability or other functional purposes. Word segmentation is a prior step for down- stream tasks such as part-of-speech (POS) tagging and thus, the robustness of POS tag- ging highly depends on word segmentation. The conventional Khmer POS tagging is a two-stage process that begins with word seg- mentation and then actual tagging of each word, afterward. In this work, a joint word segmentation and POS tagging approach us- ing a single deep learning model is proposed so that word segmentation and POS tagging can be performed spontaneously. The pro- posed model was trained and tested using the publicly available Khmer POS dataset. The validation suggested that the performance of the joint model is on par with the conven- tional two-stage POS tagging. Keywords: Khmer, NLP, POS tagging, Deep Learning, LSTM, RNN 1 Introduction 1.1 Background Khmer (KHM) is the offcial language of the Cambodia kingdom and the Khmer script is used in the writing system of Khmer and other minority languages such Kuay, Tam- puan, Jarai Krung, Brao and Kravet. Khmer language and writing system were hugely in- fluenced by Pali and Sanskrit in early history [1]. Khmer script is believed to be originated from the Brahmi Pallava script. The Khmer writing system has been undergone ten evo- lutions for over 1400 years. In the early- and mid-19th century, there was an effort to ro- manize the Khmer language; however, it was not successful. This incident led to Khmer- ization in the middle of and late 19th-century [1]. Khmer is classified as a low-resource lan- guage [2]. Khmer text is written from left to right with optional space. Space is not served as a word boundary but instead, it is used for readability or other functional purposes. Therefore, word segmentation is a prior step in Khmer text processing tasks. Various ef- forts including dictionary-based and statis- tical models are made to solve the Khmer segmentation problem [3][4]. Khmer has two distinct phonological fea- tures from the other Southeast Asian lan- guages such as Thai and Burmese. Khmer is not a tonal language and therefore, a large set of vowel phonemes are available to com- pensate for this absence of tones. Khmer has also a large set of consonant clusters (C1C2 or C1C2C3). Khmer allows complex initial consonant clusters at the beginning of a syl- lable [2]. 1.2 POS Tagging Part-of-Speech tagging is one of the sequence labelling tasks in which a word is assigned to one of a predefined tag set according to its syntactic function. POS tagging is re- quired for various downstream tasks such as spelling check, parsing, grammar induction, word sense disambiguation, and information retrieval [5]. There is no explicit word delimiter in the Khmer writing system. Automatic word seg- mentation is run to obtain segmented words and POS tagging is performed afterwards. The performance of POS tagging is reliant on the results of segmentation in this two-

Transcript of Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

Page 1: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

Joint Khmer Word Segmentation and Part-of-Speech TaggingUsing Deep Learning

Rina Buoy† Nguonly Taing† Sokchea Kor ‡

†Techo Startup Center (TSC)‡Royal University of Phnom Penh (RUPP)

{rina.buoy,nguonly.taing}@[email protected]

Abstract

Khmer text is written from left to rightwith optional space. Space is not servedas a word boundary but instead, it is usedfor readability or other functional purposes.Word segmentation is a prior step for down-stream tasks such as part-of-speech (POS)tagging and thus, the robustness of POS tag-ging highly depends on word segmentation.The conventional Khmer POS tagging is atwo-stage process that begins with word seg-mentation and then actual tagging of eachword, afterward. In this work, a joint wordsegmentation and POS tagging approach us-ing a single deep learning model is proposedso that word segmentation and POS taggingcan be performed spontaneously. The pro-posed model was trained and tested using thepublicly available Khmer POS dataset. Thevalidation suggested that the performance ofthe joint model is on par with the conven-tional two-stage POS tagging.

Keywords: Khmer, NLP, POS tagging,Deep Learning, LSTM, RNN

1 Introduction

1.1 BackgroundKhmer (KHM) is the official language of theCambodia kingdom and the Khmer script isused in the writing system of Khmer andother minority languages such Kuay, Tam-puan, Jarai Krung, Brao and Kravet. Khmerlanguage and writing system were hugely in-fluenced by Pali and Sanskrit in early history[1].

Khmer script is believed to be originatedfrom the Brahmi Pallava script. The Khmerwriting system has been undergone ten evo-lutions for over 1400 years. In the early- and

mid-19th century, there was an effort to ro-manize the Khmer language; however, it wasnot successful. This incident led to Khmer-ization in the middle of and late 19th-century[1]. Khmer is classified as a low-resource lan-guage [2].

Khmer text is written from left to rightwith optional space. Space is not servedas a word boundary but instead, it is usedfor readability or other functional purposes.Therefore, word segmentation is a prior stepin Khmer text processing tasks. Various ef-forts including dictionary-based and statis-tical models are made to solve the Khmersegmentation problem [3][4].

Khmer has two distinct phonological fea-tures from the other Southeast Asian lan-guages such as Thai and Burmese. Khmeris not a tonal language and therefore, a largeset of vowel phonemes are available to com-pensate for this absence of tones. Khmer hasalso a large set of consonant clusters (C1C2or C1C2C3). Khmer allows complex initialconsonant clusters at the beginning of a syl-lable [2].

1.2 POS TaggingPart-of-Speech tagging is one of the sequencelabelling tasks in which a word is assignedto one of a predefined tag set according toits syntactic function. POS tagging is re-quired for various downstream tasks such asspelling check, parsing, grammar induction,word sense disambiguation, and informationretrieval [5].

There is no explicit word delimiter in theKhmer writing system. Automatic word seg-mentation is run to obtain segmented wordsand POS tagging is performed afterwards.The performance of POS tagging is relianton the results of segmentation in this two-

Page 2: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

stage approach [5].For languages such as Khmer, Thai,

Burmese which do not explicit word sepa-rator, the definition of words is not a natu-ral concept and therefore, segmentation andPOS tagging cannot be separated as bothtasks unavoidably affect one another [2].

In this paper, we, thus, propose a jointword segmentation and POS tagging usinga single deep learning network to removethe adverse dependency effect. The pro-posed model is a bidirectional long short-term memory (LSTM) recurrent networkwith one-to-one architecture. The networktakes inputs as a sequence of characters andoutputs a sequence of POS tags. We use thepublicly available Khmer POS dataset by [5]to train and validate the model.

2 Related Work

2.1 Word SegmentationOne of the early researches on Khmer wordsegmentation was done by [3]. The authorsproposed a bidirectional maximum match-ing approach (BiMM) to maximize segmen-tation accuracy. BiMM processed maxi-mum matching twice – forward and back-ward. The average accuracy of BiMM wasreported to be 98.13%. BiMM was, how-ever, unable to handle out-of-vocabularywords(OOV) and take into account the con-text.

Another word segmentation approach wasproposed by [6][7] which used conditionalrandom field (CRF). The feature templatewas defined using a trigram model and 10tags for each input character. 5000 sentenceswere manually segmented and were used totrain a first CRF model which was used tosegment more sentences. Additional man-ual hand-corrections were needed to build atraining corpus of 97,340 sentences. 12,468sentences were used as a test set. Two CRFmodels were trained. The 2-tags model wasfor predicting only word boundary and the 5-tags model for predicting both word bound-ary and identifying compound words. Bothmodels obtained the same F1 score of 0.985.

[8] proposed a deep learning approach forthe Khmer word segmentation task. Twolong short-term memory networks were stud-

ied. The character-level network took in-puts as a sequence of characters while thecharacter-cluster networks took, instead, asequence of character clusters (KCC).

2.2 Khmer POS Tagging

Khmer is a low-resource language with lim-ited natural language processing (NLP) re-search. One of the earlier research on POStagging used a rule-based transformation ap-proach [9]. The same authors later intro-duced a hybrid method by combining tri-gram models and the rule-based transfor-mation approach. 27 tags were defined.The dataset was manually annotated andcontained about 32,000 words. For knownwords, the hybrid model could achieve up to95.55% and 94.15% on training and test set,respectively. For unknown words, 90.60%and 69.84% on training and test set were ob-tained.

Another Khmer POS tagging study wasdone by [10] which used a conditional ran-dom field model. The authors recycled thetag definitions in [9] and built a training cor-pus of 41,058 words. The authors experi-mented with various feature templates by in-cluding morphemes, word-shapes and name-entities and the best template gave an accu-racy of 96.38%.

Based on Choun Nat dictionary, [5] de-fined 24 tags, some of which were added forconsidering word disambiguation task. Theauthors used an automatic word segmenta-tion tool by [6] to segment 12,000 sentencesalong with some manual corrections. Vari-ous machine learning algorithms were usedto train POS tagging. These include HiddenMarkov Model (HMM), Maximum Entropy(ME), Support Vector Machine (SVM), Con-ditional Random Fields (CRF), Ripple DownRules-based (RDR) and Two Hours of Anno-tation Approach (combination of HMM andMaximum Entropy Markov Model). RDRapproach outperformed the rest by achievingan accuracy of 95.33% on the test set whileCRF, HMM and SVM approaches achievedcomparable results.

Page 3: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

2.3 Joint Segmentation and POStagging

For most Indo-European languages, POStagging can be done after segmentation sincespaces are used to separate words in theirwriting system. Most Eastern and South-Eastern Asian, on the contrary, do not useany explicit word separator and the defini-tion of words is not well defined. Segmenta-tion and POS tagging are, therefore, cannotbe separated. The authors suggested apply-ing a joint segmentation and POS taggingfor low-resource languages which share simi-lar linguistic features as Khmer and Burmese[2].

3 Modified POS Tag Set

[5] proposed a comprehensive set of 24 POStags which were derived from Choun Nat dic-tionary. These tag sets are shown in Table1. In this work, we propose the following re-visions to the tag set defined in [5]:

• Grouping the measure (M) tag undernoun (NN) tag since the syntactic role ofthe measure tag is the same as the nountag. For example, some words belong-ing to the measure tag are កបល (head),សរមាប (set), and អងគ (person).

• Grouping the relative pronoun (RPN)tag under the pronoun tag since the rel-ative pronoun tag has only one word(ែដល).

• Grouping the currency tag (CUR), dou-ble tag (ៗ - DBL), et cetera tag (។ល។- ETC), and end-of-sentence tag (។ -KAN) under the symbol (SYM) tag.

• Grouping the injection (UH) tag underthe particle tag (PA).

• Grouping the adjective verb (VB_JJ)and compliment verb (V_COM) tag un-der the verb (VB) tag.

After applying the above revisions, the re-sulting tag set consists of 15 tags which isshown in Table 2.

The descriptions of the revised POS tagsare as follows:

1. Abbreviation (AB): In Khmer writing,an abbreviation can be written with orwithout a dot. Without an explicit dot,there is an ambiguity between a wordand an abbreviation. For example, គមor គ.ម. (Kilometer).

2. Adjective (JJ): An adjective is a wordused to describe a noun and is gener-ally placed after nouns except for loan-words from Pali or Sangkrit [5]. Somecommon Khmer adjectives are, for ex-ample, ស (white) លអ (good) តច (small),and ធ (big). ឧតតម and មហា are examplesof Pali/Sangkrit loanword.

3. Adverb (RB): An adverb is a word usedto describe a verb, adjective and an-other adverb [5][11]. For example, somewords belonging to the adverb tag areេពក (very),ណាស (very), េហើយ (already),and េទើប (just).

4. Auxiliary Verb (AUX): Only threewords are tagged as an auxiliary verband their syntactic role is to indicatetense [5]. បាន or មាន indicates pasttense. កពង indicates progress tense. នងindicates future tense.

5. Cardinal Number (CD): A cardinalnumber is used to indicate the quantity[5]. Some examples of cardinal numbersare ១ (one), ប (three), បន (four), andលាន (million).

6. Conjunction (CC): A conjunction is aword used to connect words, phrasesor clauses [5][11]. For example, somewords belonging to the conjunction tagare េបើ (if), របសនេបើ (if), ពេរពាះ (because),េរពាះ (because), and ពេនាះេសាត (never-theless).

7. Determiner Pronoun (DT): A deter-miner is a word used to indicate the lo-cation or uncertainty of a noun. Deter-miners are equivalent to English words:this, that, those, these, all, every, each,and some. In Khmer grammar, a de-terminer is tagged as either a pronounor adjective [11]. However, a deter-miner pronoun tag is used in [5] andthis work. For example, some words

Page 4: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

Table 1. The POS Tag Set Proposed by [5]No. Tags No. Tags1 AB Abbreviation 13 NN Noun2 AUX Auxiliary Verb 14 PN Proper Noun3 CC Conjunction 15 PA Particle4 CD Cardinal Number 16 PRO Pronoun5 CUR Currency 17 QT Question Word6 DBL Double Sign 18 RB Adverb7 DT Determiner Pronoun 19 RPN Rel. Pronoun8 ETC Khmer Sign 20 SYM Symbol/Sign9 IN Preposition 21 UH Interjection10 JJ Adjective 22 VB Verb11 KAN Khmer Sign 23 VB_JJ Adjective Verb12 M Measure 24 V_COM Verb Compliment

Table 2. The Revised POS Tag Set Proposed in This WorkNo. Tags No. Tags1 AB Abbreviation 9 NN Noun2 AUX Auxiliary Verb 10 PN Proper Noun3 CC Conjunction 11 PA Particle4 CD Cardinal Number 12 PRO Pronoun5 DT Determiner Pronoun 13 QT Question Word6 IN Preposition 14 RB Adverb7 JJ Adjective 15 SYM Symbol/Sign8 VB Verb

belonging to the determiner pronountag are េនះ (this), េនាះ (that), សពវ (ev-ery), ទាងេនះ (these), ������� (those), and ខលះ(some).

8. Pronoun (PRO): A pronoun is a wordused to refer to a noun or noun phrasewhich was already mentioned [5][11]. Inthis work, a pronoun tag is used to tagboth a personal pronoun and a relativepronoun. In Khmer grammar, there isonly one relative pronoun which is ែដល(that, which, where, who).

9. Preposition (IN): A preposition is aword used with a noun, noun phrase,verb or verb phrase to time, place, lo-cation, possession and so on [5][11]. Forexample, some words belonging to thepreposition tag are េនេលើ (above), កាលព(from), តាម (by), and អព (about).

10. Noun (NN): A noun is a word used toidentify a person, animal, tree, place,and object in general [5][11]. For exam-

ple, some words belonging to the nountag are សសស (student), េគា (cow), ត (ta-ble), and កសករ (farmer).

11. Proper Noun (PN): A proper noun isa word used to identify the name of aperson, animal, tree, place, and locationin particular [5][11]. For example, somewords belonging to the proper noun areសខា (Sokha - a person’s name), ភនេពញ(Phnom Penh), កមពជា (Cambodia), andសធអន (CTN).

12. Question Word (QT): Some examples ofquestion word are េតើ (what) and ដចេមតច(how).

13. Verb (VB): A verb is a word usedto describe action, state, or condition[5][11]. For example, some words be-longing to the verb tag are េដើរ (towalk), ជា (to be), េមើល (to watch), andេ�សក (to be thirsty). [5] invented twomore tags for verb, which were theadjective verb (VB_JJ) and compli-

Page 5: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

ment verb (V_COM). A VB_JJ wasused to tag any verb behaving likean adjective in large compound wordssuch as មាសនេបាកេខាអាវ (washing ma-chine), កបតចតបែនល (knife) and so on.A V_COM, on the other hand, wasused to tag any verb in verb phrasesor collocations such េរៀនេចះ (to learn),របលងជាប (to pass), and so on. BothVB_JJ and V_COM are dropped inthis work for the reason that identi-fying the VB_JJ and V_COM shouldbe done via compound and collocationidentification task [12]. Certain Khmercompounds are loose and have the samestructure as a Subject-Verb-Object sen-tence [13][2]. This is illustrated in Fig-ure 1.

14. Particle (PA): There is no clear con-cept of a particle in Khmer grammar.[5] identified three groups of particlesnamely: hesitation, response and final.Some examples of particles are េអើ (hes-itation), េអើ (response), សន (final), andចះ (final)។

15. Symbol (SYM): Various symbols inwriting are grouped under the SYM tag.The SYM tag includes currency signs,Khmer signs (ៗ, ។, ។ល។) and variousborrowed signs ( +, -, ?) and so on.

4 POS Tagging MethodologyThe application of deep learning in theKhmer word segmentation task was first pro-posed by [8]. In this work, we extend [8]by combining word segmentation and POStagging task in a single deep learning model.The proposed model is a variant of recurrentneural networks. The details are explainedbelow.

4.1 Recurrent Neural NetworkA recurrent neural network (RNN) is a typeof neural network with a cycle in its connec-tions. That means the value of an RNN celldepends on both inputs and its previous out-puts. Elman networks or simple recurrentnetworks have been proven to be very effec-tive in NLP tasks [14]. An illustration of a

simple recurrent network is given in Figure2.

The hidden vector, ht depends on both in-put, xt and previous hidden vector, ht−1.

ht = g(Uht−1 +Wxt) (1)Where:

• U and W are trainable weight matrices.They are shared by all time steps in thesequence.

The fact that the computation of ht re-quires ht−1 makes RNN an ideal candidatefor sequence tagging tasks such as POS tag-ging.

4.2 Recurrent Neural Network forSequence Labelling

Various forms of RNN architecture are givenin Figure 3. The choice of an RNN architec-ture depends on the application of interest.A POS tagging task is a sequence labellingand one-to-one architecture is suitable. Apossible RNN model for POS tagging is givenin Figure 4. Here are the forward steps takenby an RNN model:

1. A sequence of inputs are encoded or em-bedded with an embedding layer.

2. Hidden vectors are computed by un-rolling the computational graph throughtime.

3. At each time step, the RNN cell outputsan output vector.

4. A Softmax activation is applied to theoutput vectors.

4.3 StackingOne or more RNN layers can be stacked ontop of each other to form a deep RNN model.An illustration of a stacked RNN model isshown in Figure 4. In a stacked RNN model,an input sequence is fed to the first RNNlayer to produce h1t . h1t is then fed to anotherRNN layer until the layer output layer.

Stacking is used to learn representationsat different levels across layers. The optimalnumber of the stack depends on the appli-cation of interest and is a matter of hyper-parameter tuning [14].

Page 6: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

Figure 1. Similar Structure of a Sentence and Compound Noun

Figure 2. A simple RNN [14]

Figure 3. Various RNN architectures [15]

Page 7: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

Figure 4. A example RNN model for POS tagging task [14]

4.4 BidirectionalityIn forward mode (from left to right), h1t rep-resents what the model has processed fromthe first input in the sequence until the in-put at a time, t.

hft = RNNforward(xt1) (2)

Where:

• hft represents the forward hidden vectorafter the network sees an input sequenceup to time step, t.

• xt1 are inputs up to time, t.

In certain applications such as sequence la-belling, a model can have access to the en-tire input sequence. It is possible to processan input sequence from both directions - for-ward and backward [14]. hbt representing thehidden vector up to time, t in reverse ordercan be expressed as:

hbt = RNNbackward(xnt ) (3)

Where:

• hbt represents the backward hidden vec-tor after the network sees an input se-quence from timestep, n to t.

• xnt are inputs up to time, t in reserveorder.

An RNN model which processes from bothdirections is known as a bidirectional RNN(bi-RNN). hft and hbt can be averaged or con-catenated to form ht.

4.5 Long Short-Term Memory(LSTM)

When processing a long sequence, an RNNcell has difficulty in carrying forward criticalinformation for two reasons [14]:

• The weight matrices need to provide in-formation for the current output as wellas the future outputs.

• The back-propagation through time suf-fers from vanishing gradients problemdue to repeated multiplications along along sequence.

Long Short-Term Memory (LSTM) net-works were devised to address the above is-sues by introducing sophisticated gate mech-anisms and an additional context vector tocontrol the flow of information into and outof the units [14]. Each gate in LSTM net-works consists of:

Page 8: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

1. A feed-forward layer

2. A sigmoid activation

3. Element-wise multiplication with thegated layer

The choice of a sigmoid function is to gateinformation flow as it tends to squash theoutput to either zero or one. The combinedeffect of a sigmoid activation and element-wise multiplication is the same as binarymasking [14].

There are three gates in an LSTM cell.The details are as follow:

• Forget Gate: As the name implies, theobjective of a forget gate is to removeirrelevant information from the contextvector. The equations of a forget gateare given below:

ft = σ(Ufht−1 +Wfxt) (4)

kt = ct−1 ⊙ ft (5)

Where:– σ is a sigmoid activation.– Uf and Wf are weight matrices.– ft and ct are vectors at time, t.– ct−1 is a previous context vector.– ⊙ is an element-wise multiplication

operator.

• Add Gate: As the name implies, the ob-jective of an add gate is to add relevantinformation to the context vector. Theequations of an add gate are given be-low:

gt = tanh(Ught−1 +Wgxt) (6)

Like a standard RNN cell, the aboveequation extracts information from theprevious hidden vector and current in-put.

it = σ(Uiht−1 +Wixt) (7)jt = gt ⊙ it (8)

The current context vector is updatedas follows:

ct = jt + kt (9)

Where:

– Ui, Wi, Ug, and Wg are weight ma-trices.

– gt, it, and jt are vectors at time, t.– ct is a current context vector at

time, t.

• Output Gate: As the name implies, theobjective of an output gate is to deter-mine what information is needed to up-date a current hidden vector at time, t:

ot = σ(Uoht−1 +Woxt) (10)ht = ot ⊙ tanh(ct) (11)

Where:

– Uo and Wo are weight matrices.– ot is a vectors at time, t.– ht is a hidden vector at time, t.

4.6 Bidirectional LSTM Network forJoint Word Segmentation andPOS Tagging

In this section, we introduce a bidirectionalLSTM (Bi-LSTM) network at character levelfor joint word segmentation and POS tag-ging. It is bidirectional since the model hasaccess to an entire input sequence during for-ward run.

The descriptions of the proposed Bi-LSTMnetwork shown in Figure 5 are as follows:

1. Inputs: The network takes a sequence ofcharacters as inputs.

2. Inputs Encoding: One-hot encoding isused to encode an input character.

3. LSTM Layers: The network processesan input sequence in both directions -forward and backward. The forwardand backward hidden vector in the finalLSTM stack is concatenated to form asingle hidden vector.

4. Feed-forward Layer: The concatenatedhidden vector is fed into a feed-forwardlayer to produce an output vector. Thesize of the output vector is equal to thenumber of POS tags plus one as an addi-tional no-space (NS) tag is introduced.No space tag is explained in the outputrepresentation section.

Page 9: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

5. Softmax: Softmax activation is appliedto the output vector to produce proba-bilistic outputs.

In the proposed models of [8], a sigmoidfunction is used to output a probabilitywhether a character or cluster is the start-ing of a word.

4.7 Cross Entropy LossMulti-class cross-entropy loss is used to trainthe proposed model. The loss is expressed asfollows:

L(y, y) =

K∑k=1

yklog(yk) (12)

Where:

• yk is a predicted probability of class, k.

• yk is 0 or 1, indicating whether class, kis the correct classification.

5 Experimental Setup5.1 Dataset - Train/Test SetThe dataset used in this work is ob-tained from [5]. The dataset is avail-able on Github (https://github.com/ye-kyaw-thu/khPOS) under CC BY-NC-SA 4.0license. The dataset consists of 12,000 sen-tences (25,626 words in total). The averagenumber of words per sentence is 10.75. Theword segmentation tool from [6] was usedand manual corrections were also performed.The original dataset has 24 POS tags. Themost common tags are NN, PN, PRO, INand VB.

We revised the dataset by grouping sometags as per the above discussion. The re-vised dataset has 15 tags in total - 9 tagsfewer. The count and proportion of each tagare given in Figure 3 in descending order.

The 12,000-sentence dataset was used as atraining set. [5] provided a separate open testset for evaluating the model performance.

5.2 Inputs and Labels PreparationAn example of a training sentence is givenbelow.ខញ/PRO �សលាញ/VB ែខមរ/PN ។/SYM(I love Khmer)

Table 3. POS Tag Frequency in the TrainingSet

Tags Frequency PercentageNN 32297 25.03%PN 20084 15.57%VB 18604 14.42%PRO 13950 10.81%IN 13446 10.42%RB 6428 4.98%SYM 5839 4.53%JJ 4446 3.45%DT 4311 3.34%CD 3337 2.59%CC 2788 2.16%AUX 2466 1.91%PA 885 0.69%QT 79 0.06%AB 69 0.05%Total 129029 100.00%

• Space denotes word boundary.

• POS tag of a word is right after slash.

The corresponding input and target se-quence of the above training sentence is il-lustrated in Figure 6. If a character is thestarting character of a word, its label is thePOS tag of the word and it also means thebeginning of a new word. Otherwise, the no-space (NS) tag is assigned.

5.3 Training ConfigurationThe network was implemented in the Py-Torch framework and trained on Google Co-lab Pro. Training utilized mini-batch onGPU.

The following hyper-parameters were used:

• Number of LSTM stacks = 2

• Hidden dimension = 100

• Batch size = 128

• Optimizer = Adam with learning rate of0.001

• Epochs = 100

• Loss function = categorical cross en-tropy loss

Page 10: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

Figure 5. The proposed Bi-LSTM network for joint segmentation and POS tagging at characterlevel

Figure 6. Input and Output Sequence Representation

Page 11: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

Table 4. Accuracy of Word SegmentationMetric Training Set Test SetAccuracy 99.27% 97.11%

Each input character was encoded to aone-hot vector of 132 dimensions, which isthe number of Khmer characters includingnumbers and other signs.

6 Results and EvaluationThe accuracy of word segmentation is definedas:

Accuracy =Countcorrectword

Countcorpusword

(13)

Where:

• Countcorrectword is the number of correctlysegmented words.

• Countcorpusword is the number of works inthe corpus.

The accuracy for a given tag, i is definedas below:

Accuracy =Countcorrectpos,i

Countcorpuspos,i

(14)

Where:

• Countcorrectpos,i is the number of correctlypredicted POS tag, i.

• Countcorpuspos,i is the number of POS tag,i in the corpus.

The overall POS tagging accuracy is givenby:

Accuracy =

∑tagi=1Countcorrectpos,i∑tagi=1Countcorpuspos,i

(15)

Where:

• tag is the number of POS tags.

The segmentation and overall POS taggingaccuracy are given Table 4 and 6, respec-tively while the accuracy breakdowns by tagare given in Table 5

Table 5. POS Tag Accuracy BreakdownsTags Train TestNN 98.43% 93.75%PN 98.94% 96.40%VB 96.90% 89.12%PRO 98.86% 97.39%IN 97.49% 92.05%RB 95.00% 87.23%SYM 100.00% 99.74%JJ 92.35% 79.89%DT 99.71% 96.99%CD 98.52% 96.82%CC 96.06% 94.09%AUX 100.00% 98.15%PA 95.77% 83.33%QT 100.00% 100.00%AB 100.00% 83.33%

Table 6. Overall Accuracy of POS taggingMetric Training Set Test SetAccuracy 98.14% 94.00%

7 DiscussionThe trained model achieved the segmenta-tion and POS tagging accuracy of 97.11%and 94.00% , respectively. Compared with[5], our POS tagging accuracy was 1.33%lower on the open test set. In the work of [5],the overall error should be composed of twocomponents - segmentation and POS taggingerror and is approximated by the below equa-tion:

ϵt = ϵs + ϵp (16)Where:

• ϵt: is the overall error.

• ϵs: is the segmentation error.

• ϵp: is the POS tagging error.

Since the segmentation error was not in-cluded in [5], the reported error was just thePOS tagging error. [5] used the segmenta-tion tool by [6] with the reported error (ϵs) of1.5%. The estimated overall error (ϵt) of [5]is about 6.17% since the highest accuracy of[5] was reported to be 95.33% (ϵp = 4.67%).

Thus, the performance of the joint seg-mentation and POS tagging model with ϵt of

Page 12: Joint Khmer Word Segmentation and Part-of-Speech Tagging ...

6.00% is on par with the conventional two-stage POS tagging method.

8 Conclusion and Future WorkIn this work, we proposed joint word segmen-tation and POS tagging using a deep learn-ing approach. We presented a bidirectionalLSTM network that takes inputs at the char-acter level and outputs a sequence of POStags. The overall accuracy of the proposedmodel is on par with the conventional two-stage Khmer POS tagging. We believe thetraining dataset available is limited in sizeand a significantly larger dataset is requiredto train a more robust joint POS taggingmodel with greater generalization ability.

References[1] Makara Sok. Phonological Principles

and Automatic Phonemic and PhoneticTranscription of Khmer Words. PhDthesis, Payap University, 2016.

[2] Chenchen Ding, Masao Utiyama, andEiichiro Sumita. Nova: A feasible andflexible annotation system for joint to-kenization and part-of-speech tagging.ACM Trans. Asian Low-Resour. Lang.Inf. Process., 18(2), December 2018.

[3] Narin Bi and Nguonly Taing. Khmerword segmentation based on bi-directional maximal matching forplaintextand microsoft word. Signaland Information Processing Associa-tion Annual Summit and Conference(APSIPA), 2014.

[4] Chea Sok Huor, Top Rithy, Ros PichHemy, Vann Navy, Chin Chanthirith,and Chhoeun Tola. Word bigram vsorthographic syllable bigram in khmerword. PAN Localization Team, 2007.

[5] Ye Kyaw Thu, Vichet Chea, and Yoshi-nori Sagisaka. Comparison of sixpos tagging methods on 12k sentenceskhmer language pos tagged corpus. 1stRegional Conference on OCR and NLPfor ASEAN Languages, 2017.

[6] Vichet Chea, Ye Kyaw Thu, ChenchenDing, Masao Utiyama, Andrew Finch,and Eiichiro Sumita. Khmer wordsegmentation using conditional random

fields. Khmer Natural Language Pro-cessing, 2015.

[7] Ye Kyaw Thu, Vichet Chea, AndrewFinch, Masao Utiyama, and EiichiroSumita. A large-scale study of statis-tical machine translation methods forkhmer language. In Proceedings of the29th Pacific Asia Conference on Lan-guage, Information and Computation,pages 259–269, Shanghai, China, Octo-ber 2015.

[8] Rina Buoy, Nguonly Taing, and SokcheaKor. Khmer word segmentation usingbilstm networks. 4th Regional Confer-ence on OCR and NLP for ASEAN Lan-guages, 2020.

[9] C. Nou and W. Kameyama. Khmerpos tagger: A transformation-basedapproach with hybrid unknown wordhandling. In International Conferenceon Semantic Computing (ICSC 2007),pages 482–492, 2007.

[10] Sokunsatya Sangvat and CharnyotePluempitiwiriyawej. Khmer pos taggingusing conditional random fields. Com-munications in Computer and Informa-tion Science, 2017.

[11] National Council of Khmer Language.Khmer Grammar Book. National Coun-cil of Khmer Language, 2018.

[12] Wirote Aroonmanakun. Thoughts onword and sentence segmentation in thai.2007.

[13] Sok Khin. Khmer Grammar. RoyalAcademy of Cambodia, 2007.

[14] Dan Jurafsky and James H. Martin.Speech and Language Processing. 3rd eddraft, 2020.

[15] Andrej Karpathy. The unreasonable ef-fectiveness of recurrent neural networks.