الرسالة 15 1 2014

مقارنة بعض تقنيات تصنيف النصوص العربية باستخدام نموذج خليط

متعدد الحدود

Comparison Some of Arabic Text Classification

Techniques using a Multinomial Mixture Model

Prepared by:

SihamAbdalhadyHasan

Supervised by:

Prof.Ghassan Kanaan

This Thesis was submitted in Partial Fulfilments of the

Requirements for the Master’s Degree of Science in Computer

Science Faculty of Computer Sciences and Informatics

Amman Arab University

2013

ii

Authorization

I, Siham Abdalhady Hasan, authorize Amman Arab University to

provide copies of my thesis to libraries, institutions or any one

requesting a copy.

Name: Siham Abdalhady Hasan

Signature:………………………

Date: 11/1/2013

iii

Name: Siham Abdalhady Hasan.

Degree: Master of Computer Science.

Title of thesis in English:

Comparison Some of Arabic Text Classification

Techniques using a Multinomial Mixture Model

Title of thesis in Arabic:

متعددا مقارنة بعض تقنيات تصنيف النصوص العربية باستخدام نموذج خليط

لحدود

Examining Committee Signature

Dr.Riyad F.Al-Shalabi

Dr.Ghassan Kanaan

Dr.Omar Dabas

iv

Abbreviations

Abbreviation Description

IR Information Retrieval

TC Text Classification

ATC Arabic Text Classification

WWW World Wide Web

MMM Multinomial Mixture Model

KNN K-Nearest Neighbour

NB Naïve Bayes

SVM Support Vector Machine

D Document

C Class

HTML Hyper Text Markup Language

SGML Standard Generalized Markup Language

XML Extensible Markup Language

Re Recall

Pr Precision

FFS Feature Subset Selection

VMF Von Mises Fisher

BPSO Binary Particle Swarm Optimisation

LDA Latent Dirichlet Allocation

MaP Macro Precision

MaR Macro Recall

TF Term Frequency

IDF Inverse Document Frequency

v

Table of Contents

1. Chapter One: Introduction .......................................................................................................... 1

1.1. Introduction ........................................................................................................................ 2

1.2. Arabic Language .................................................................................................................. 4

1.3. The Statement of the Problem............................................................................................ 7

1.4. Thesis Objective .................................................................................................................. 8

1.5. Summary ............................................................................................................................. 8

2. Chapter Two: Literature Review ................................................................................................. 9

2.1. Literature Review .............................................................................................................. 10

2.1.1. Text Classification ...................................................................................................... 10

2.1.2. Arabic Text Classification .......................................................................................... 17

2.2. Summary ........................................................................................................................... 25

3. Chapter Three: Methodology .................................................................................................... 26

3.1. Introduction ...................................................................................................................... 27

3.2. System Architecture .......................................................................................................... 27

1.2.3. Arabic Corpus ............................................................................................................ 28

3.2.2. Pre-processing ........................................................................................................... 29

3.2.3. Classifiers ................................................................................................................... 33

3.2.4. Evaluation .................................................................................................................. 40

3.3. Summary ........................................................................................................................... 41

4. Chapter Four: Experiments and Evaluation .............................................................................. 42

4.1. Introduction ...................................................................................................................... 43

4.2. Data Set Preparation ......................................................................................................... 43

4.3. Performance measures: .................................................................................................... 44

4.4. Evaluation Results ............................................................................................................. 48

4.4.1. Naïve Bayes Algorithm Using (MMM). ...................................................................... 48

4.4.2. Comparisons MMM With Other Techniques And Discussions 0f Results ................. 51

4.5. Results of Naïve Bayes algorithm (MMM) with 5070 documents .................................... 54

4.6. Summary ........................................................................................................................... 58

5. Chapter Five: Conclusion and Future Work .............................................................................. 59

vi

.. ..................................................................................................................................................... 59

5.1. Conclusion: ........................................................................................................................ 60

5.2. Future Work: ..................................................................................................................... 60

Reference ...................................................................................................................................... 61

vii

Table of Figures

Figure 3-1Text Classification Architecture ........................................................................................ 28

Figure 3-2Pre-processing Steps ......................................................................................................... 29

Figure 3-3 An example for KNN ......................................................................................................... 36

Figure 4-1Result Of the Naive Bayes – MMM Classification Algorithm ............................................ 51

Figure 4-2Maf1, Mif1 Comparison for Classifiers ............................................................................. 52

Figure 4-3Map Comparison for Classifiers ........................................................................................ 53

Figure 4-4 MaR Comparison for Classifiers ....................................................................................... 53

Figure 4-5Precision, Recall and F-Measure for the Three Classifiers ................................................ 54

Figure 4-6Result Of the Naive Bayes Classification Algorithm .......................................................... 57

viii

List of Tables

Table 3-1Strings removed by light stemming ................................................................................... 31

Table 4-1Number of Documents for each Category ......................................................................... 44

Table 4-2 Confusion Matrix for Performance Measures .................................................................. 45

Table 4-3The Global Contingency Table ........................................................................................... 46

Table 4-4Macro-Average ................................................................................................................... 46

Table 4-5Micro-Average .................................................................................................................... 47

Table 4-6Confusion Matrix Results for NB Using MMM Algorithm .................................................. 49

Table 4-7 Confusion Matrix Results for NB Algorithm ...................................................................... 50

Table 4-8NB Using MMM Classifier Weighted Average for the Nine Categories ............................. 51

Table 4-9Classifier Comparison ......................................................................................................... 52

Table 4-10Categories and Their Distributions in the Corpus (5070 Documents) ............................. 55

Table 4-11 Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents) ............... 55

Table 4-12Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents) ................ 56

Table 4-13Nab Using Mmm Classifier Weighted Average for the Six Categories in the Arabic Corpus

(5070 Documents) ............................................................................................................................. 57

ix

Acknowledgements

I would like to express my sincerest gratitude to my supervisor, Prof. Ghassan

Kanaan, who has been exceptionally patient and understanding with me during

my studies. Without his kind words of encouragement and advice this work

would not have been possible.

I am extremely grateful to all staff that has assisted me in the Department of

Computer sciences and informatics, especially Prof.Ala’aAl-Hamami.

Thanks also to all of my other colleagues in the Computer sciences and

informatics for making my time here an enjoyable experience.

I would like to thank the Libyan Embassy in Amman to take care of me to

supplement my study.

The support of my family and friends has been much appreciated, and most

importantly, I would like to thank my husband, Ali and my children, to whom I

am indebted for all of the moral and loving support they have given me during

this time.

x

Abstract

Text Classification (TC) assigns documents to one or more predefined

categories based on their contents. This project focuses on the comparison of

three automatic TC techniques: Rocchio, K-Nearest Neighbor (KNN) and

Naïve Bayes (NB) classifier using a multinomial mixture model (MMM) on

Arabic language. In order to evaluation the mentioned techniques using the

MMM, an Arabic TC corpus that consists of 1445 Arabic documents that are

classified into nine categories: Computer, Economics, Education, Sport,

Politics, Engineer, Medicine, Law, and Religion. The main goal of this project

is to compare some of automatic text classification technique using a

multinomial mixture model on the Arabic language. The classification

effectiveness has been compared with the SVM model. This model was applied

in other project used the same traditional classifiers and the same collection.

Moreover; the experimental results are presented in terms of macro-averaging

precision, macro-averaging recall and macro-averagingF1 measures.

Furthermore, the results reveal that the naive Bayes using MMM work best for

Arabic TC tasks and outperformed k-NN and Rocchio classifiers.

1

1. Chapter One: Introduction

2

1.1. Introduction

The rapid development of the Internet, a larger number of Arabic information

is an available online; this motivates researchers to find some tools that may

help people to classify the huge Arabic information.

The information retrieval system is designed to analyse process, store sources

of information and retrieve those that match a particular user's requirements. In

other words in IR system, the similarity scores between a query and a set of

documents has been calculated, and the relevant documents have been ranked

based on their similarity scores. There are two main issues in the IR systems,

the first one is that characterization of the user information need is not always

clear and need to be transformed in order to be understood by the IR system

which is known as Query (short document contain few words) [1]. The second

problem is in the structure of the information where there are no standards or

rules that control this structure especially on the World Wide Web (WWW) and

each language has its own characteristics and semantics. In addition, the users

need to find excellent information which is suitable for their requirement.

Furthermore, the time has been taken in account to catch information quickly.

The issues mention to very important topic which is Text Classification (TC).

Text Classification system is classification all documents into static number of

predefined categories based on their content. Moreover, the text classification

may be either single-label where exactly one category must be assigned to each

document or multi-label where one or more categories can be assigned to each

document. Therefore, the main object of use TC is make the IR system result

better than without TC [2]. These advantages have led to the development of

automatic text and document classification systems. These are capable of

automatically organizing and classifying documents [3].

3

Classification process can be done by manual or automatic, It is interesting to

note that manual categorization consider difficult and complex task especially

with huge information, because it classifies documents one by one via human

experts. Furthermore, the time to complete this mission will be too much. on

the other hand, with the speedy growth of online text documents, automatic text

categorization (TC) becomes an essential tool to develop text documents

efficiently and effectively [4].

Text Classification is the task of classifying a document under a predefined

category. Additional officially, if diis a document of the entire set of documents

D and (c1, c2,……,cn) is the set of all the categories,then text classification assigns

one category cj to a document diAccording to increased number of Arabic

information on the Internet classifying documents manually is not practical.

However an automatic text classification has become an essential task to save

human effort to perform manual text classification.

Three essential stages in the Text classification system are:

Document Index: is one of the most substantial issues in TC, which includes

document representation and a term weighting scheme. The bag of word is the

most common way to represent the context of text. This approach consider

simplicity because it is recorded only the frequency of word in document.

Moreover, all the predefine categories, the synonyms and prefix words for the

category are found and it helps to assign any document to that category based

on the synonym or prefix of a term. In addition some of term weight schemes

have been indicated with details in chapter 3.

Classifier learning : There are several mechanism learning algorithms have

been applied on automatic text classification by supervised learning[5].The

supervised learning algorithm finds a representation or judgment rule from an

4

example set of labelled documents for each class[6]. This can be illustrated

briefly by Naive Bayes (NB)[7-9], Support Vector Machine (SVM)[10-12],

Nearest Neighbour (k-NN)[13, 14], Decision Trees(DT), Rocchio [5], and

Voting …etc.

Classifier evaluation: to know the effective of classifier that is according the

achieved result from each one. Moreover, perfect evaluation measures such as

Recall, Precision and F1-measure have been used to evaluate the different

classifiers.

1.2. Arabic Language

Apply text classification systems for Arabic language is a challenging task due

to it has very complex morphology [15]. The Arabic alphabet consists of the 28

letters:

أ ب ت ث ج ح خ د ذ ر ز س ش ص ض ط ظ ع غ ف ق

ك ل م ن ه و ي

The characters (اوي) are called by vowels and rest letter are consonants. The

Arabic letters can be written by different forms, which depend on position it in

the word (beginning, middle, and end). For example the letter (ط) has several

shapes. ( ــط )if appear in the began (طريق which mean in English Road); ( ــطـ) If

the letter appears in the middle (سطح which mean in English Surface);( ط) if the

letter appears in end (مطاط which mean in English Rubber).Furthermore, the

Arabic language contain on diacritics (التشكيل) which putting over or below the

letters, The diacritics (fathah, kasra, dama, sukon, double fathah, double kasra,

double dama and shada) are used to clarify the mean of the words [16]. On top

of that, in Arabic words, when diacritics are not clearly mentioned, the text has

a several meanings. Therefore, ambiguous meaning which will negatively effect

5

on text classification. To avoid these problems pre-processing can be applied

on Arabic language.

The Arabic language has more complex morphology than English language.

The Arabic language is written from right to left. Arabic words have two

genders, feminine and masculine; three numbers, singular, dual, and plural; and

three grammatical cases, nominative, accusative, and genitive. Noun has the

nominative case when it is the subject; accusative when it is the object of a verb;

and the genitive when it is the object of a preposition. In addition to, the Arabic

sentence are divided into three parts: noun, verb, and character. The noun and

verb stems are derived from a few thousand root by infixing, for example,

creating words like حاسوب (computer), يحسب (he calculates),and نحسب (we

calculates),from the root [17]حسب.

A noun is a name or a word that describes a person, thing, or an idea.

Arabic verbs similar to English verbs, which are classified into Perfect and

Imperfect. Perfect tense denotes actions completed, while Imperfect denotes

incomplete actions. The imperfect tense has four mood: Indicative, subjective,

jussive, and imperative [18]

Arabic particles include prepositions, adverbs, conjunctions, interrogative

particles, exceptions, and interjections.

Most of Arabic words are derived from the pattern (فعل); all words following

the same pattern have common properties and states. For example, the pattern

represent the object (مفعول) indicates the subject of the verb, the pattern (فاعل)

of the verb.

An Arabic adjective can also have many variant. When an adjective modifies a

noun in a phrase, the adjective agrees with the noun in gender, number, case,

and definiteness. An adjective has a masculine singular form such as جديد (new),

6

a feminine singular form such as جديدة (new), a masculine plural form such as

.[19] (new) جديدات and a feminine plural form such as,(new) جدد

In addition to the different forms of the Arabic word that result from the

derivational process, most connectors, conjunction, prepositions, pronouns, and

possession forms are attached to the Arabic surface form as prefixes and

suffixes. For instance, the definitive nouns are formed by attaching the article

(and) (و) to the immediate front of the nouns. The conjunction word (as the) (ال)

is often attached to the following word. The letters (ل, ب,ف, ك ) can be adding to

the front to the word as prepositions. The suffix (ة) is attached to represent the

feminine gender of the word. Also some suffixes are added to represent the

possessive pronoun ها (Her) ,ي for (My) ,andهن ,هم for (Their) [19, 20].

In addition, Arabic has two kinds of plurals: sound plurals and broken plurals.

The sound plurals are formed by adding plural suffixes to singular nouns. The

plural suffix is ات for feminine nouns in all three grammatical cases, ون for

masculine nouns in nominative case, and ين for masculine nouns in genitive and

accusative cases. Moreover, the formation of broken plurals is more complex

and often irregular, it is therefore, difficult to predict .furthermore, and broken

plurals are very common in Arabic. For example, the plural form of the noun

and أ which is formed by attaching the prefix, (children) أطفال is (child) طفل

inserting the infix ا.The plural form of the noun كتاب (book) is كتب (books),which

is formed by deleting the infix ا.the plural form of إمراة (woman) is نساء (women)

the plural form completely different of singular form [19].

7

1.3. The Statement of the Problem

The IR system has been widely used to assist users with the discovery of useful

information from the Internet. Furthermore, the current IR systems are based on

the similarity and frequency term between query (user’s requirement) and the

available information on the Internet. However, the IR ignores important

semantic relationships between them. In addition, that ignorance makes the

research operation slowly and waste a lot of time. To overcome this problem,

text classification (categorization) is a solution.

Text classification technique has been applied a little on the Arabic languages

compared to other languages. Unfortunately, there are not perfect techniques to

classify the text, thus the researchers have been encouraged to develop TC

techniques by using many different models and many methods.

In this project the Multinomial Mixture Model (MMM) has been suggested and

applied to classify the Arabic documents. In addition, this experiment will be

compared with other classifiers. In order to clarify, which model can be used

well than other.

8

1.4. Thesis Objective

Arabic text can be believed completely different of the English text and had

complex morphology. In this thesis, the Multinomial Mixture Model (MMM)

has been recommended and applied to classify the Arabic documents.

Moreover, three different techniques are examined with Arabic text such as

Rocchio algorithm, traditional k-NN and naïve Bayes.

The text classification system with these techniques has been evaluated using

the standard measures: recall, precision and f-measure. Moreover, the

effectiveness of the classifier will be decided according to the results achieved.

Finally, the results of MMM has been compared with the other two algorithms

to determine the best information retrieve system to Arabic language.

1.5. Summary

This chapter gives a short introduction into Information retrieval system (IR).

It also focuses on text categorization (TC), and describes the most important

tasks of a text categorization system. After the introduction Arabic language are

has been described briefly. Moreover, the thesis’s problems are presented.

Finally; the multinomial mixture model has been adopted as thesis objective.

9

2. Chapter Two: Literature Review

10

2.1. Literature Review

Text classification is defined as assigning new documents to a set of pre-defined

categories based on the classification patterns [21, 22]. In recent years, there

have been an increasing amount of literatures on TC topic. Moreover, the

researchers have shown an increased interest to continue the research and

developed it according to the previous work.

2.1.1. Text Classification

The text classification techniques have been investigated and used in many

application areas. Moreover, there are many researchers studied text

classification using different techniques.

2.1.1.1. Classification Based on Supervised Learning

The target of classification methods is assigned class labels to unlabelled text

documents from a fixed number of unknown categories. Each document can be

multiple, exactly one, or no category at all.

Supervised machine learning methods prescribe the input and output format.

The input to these methods is a set of objects (training data) , and the output is

the classes which these objects belong.

The key advantage of supervised learning methods over unsupervised methods

is that by having a clear knowledge of the classes the different objects belong

to these algorithms can perform an effective feature selection if that leads to

better prediction accuracy [23].

Classification can obviously be formulated as a supervised learning problem

with two class labels (positive and negative). Training and testing data used in

existing research are mostly product reviews, which is not surprising due to the

above assumption. Since each review at a typical review site already has a

reviewer-assigned rating (e.g., 1-5 stars) , training and testing data are readily

11

available. Typically, a review with 4-5 stars is considered a positive review

(thumbs-up), and a review with 1-2 stars is considered a negative review

(thumbs-down).

Classification is similar to but also different from classic topic-based text

classification, which classifies documents into predefined topic classes

(politics, sciences, sports, etc.). In topic-based classification, topic related

words are important. In sentiment classification, topic-related words are

unimportant. Instead, sentiment or opinion words that indicate positive or

negative opinions are important (e.g., great, excellent, amazing, horrible, bad,

worst, etc.).

Existing supervised learning methods can be readily applied to sentiment

classification, such as, Naïve Bayesian, Support Vector Machines (SVM) [24].

This approach can be used to classify movie reviews into two classes (positive

and negative). It was shown that using unigrams (a bag of individual words) as

features in classification performed well with both Naïve Bayesian and SVM.

Neutral reviews were not used in this work, making the problem easier. The

used features are data attributes used in machine learning, not object features

referred to in the previous section.

Subsequent research used many more kinds of features and techniques in

learning. As most machine learning applications, the main task of sentiment

classification is to find a suitable set of features. Some of the example features

used in research and possibly in practice such as mentioned [25].

Terms and their frequency: These features are individual words or word n-

grams and their frequency counts. Sometimes, word positions may also be

considered. The TF-IDF weighting scheme from information retrieval may be

applied too. These features are also commonly used in traditional topic-based

12

text classification. They have been shown quite effective in sentiment

classification as well.

Part Of Speech Tags: In many early researches, it was found that adjectives

are important indicators of subjectivities and opinions. Therefore, adjectives

have been treated as special features.

Opinion Words and Phrases: Opinion words are words that are commonly

used to express positive or negative sentiments. For example, beautiful,

wonderful, good, and amazing are positive opinion words, and bad, poor, and

terrible are negative opinion words. Although many opinion words are

adjectives and adverbs, nouns (rubbish, junk, crap, etc.) and verbs (hate, and

like) can also indicate opinions. In addition to opinion words, there are also

opinion phrases and idioms (cost someone an arm and a leg). Opinion words

and phrases are helpful to sentiment analysis.

Syntactic Dependency: Words dependency based features generated from

parsing or dependency trees are also tried by several researchers.

Negation: Clearly negation words are important since their appearances often

change the opinion orientation. For example, the sentence “I don’t like this

camera” is negative. Negation words must be handled with care because not all

occurrences of such words mean negation. For example, “not” in “not only but

also” does not change the orientation direction.

Research also predicts the rating scores [24]. In this case, the problem is

formulated as a regression problem since the rating scores are ordinal. Another

investigated research direction is the transfer learning or domain adaptation. As

it had been shown, sentiment classification is highly sensitive to the domain

from which the training data are extracted. A classifier trained using

opinionated texts from one domain often performs poorly when it is applied or

13

tested on opinionated texts from another domain. The reason is that words and

even language constructs used in different domains for expressing opinions can

be substantially different. Sometimes, the same word in one domain means

positive, but in another domain means negative [26]. For example, the adjective

unpredictable may have a negative orientation in a car review (“unpredictable

steering”), but it could have a positive orientation in a movie review

(“unpredictable plot”). Therefore, domain adaptation is needed. Existing

research has used labelled data from one domain and unlabelled data from the

target domain, and general opinion words as features for adaptation [27].

2.1.1.2. Classification Based on Unsupervised Learning

Opinion words and phrases are the dominating indicators for sentiment

classification. Therefore, using unsupervised learning based on such words and

phrases would be quite natural. The methods used in [26].performs

classification based on some fixed syntactic phrases that are likely to be used to

express opinions. The algorithm consists of three steps:

Step 1: It extracts phrases containing adjectives or adverbs. The reason for

doing this is that research has shown that adjectives and adverbs are good

indicators of subjectivity and opinions. Although an isolated adjective may

indicate subjectivity, there may be an insufficient context to determine its

Opinion Orientation. (OO) Thus, the algorithm extracts two consecutive words,

where one member of the pair is an adjective/adverb, and the other is a context

word.

For example: In the sentence, “This camera produces beautiful pictures”,

“beautiful pictures” will be extracted as it satisfies the first pattern.

14

Step 2: It estimates the orientation of the extracted phrases using the Point wise

Mutual Information (PMI) measure given in equation (1.1), (1.2) and equation

(1.3) [26].

)2()1(

)2&1(2log)2,1(

termptermp

termtermptermtermPMI

1.1

P (term1& term2) is the co-occurrence probability of term1 and term2, and P

(term1) P (term2) is the probability that the two terms co-occur if they are

statistically independent. The ratio between P (term1&term2) and P (term1) P

(term2) is a measure of the degree of statistical dependence between them. The

log of this ratio is the amount of information that we acquire about the presence

of one of the words when the other is observed.

The Opinion Orientation (OO) of a phrase is computed based on its association

with the positive reference word “excellent”, and its association with the

negative reference word “poor”:

).'',()'',()( poorphrasePMIexcellentphrasePMIphraseSO 1.2

The probabilities are calculated by issuing queries to a search engine and

collecting the number of hits.

For each search query, a search engine usually gives the number of relevant

documents to the query, which is the number of hits. Thus, by searching the two

terms together and separately, we can estimate the probabilities in equation(1.1)

[26] used the AltaVista search engine because it has a NEAR operator, which

constrains the search to documents that contain the words within ten words of

15

one another, in either order. Let hits (query) are the number of hits returned.

Equation (1.2) can be rewritten as:

)'(')''(

'(')''(2log)(

excellenthitspoorphraseNEARhits

poorhitsexcellentphraseNEARhitsphraseSO

1.3

Step 3: Given a review, the algorithm computes the average OO of all phrases

in the review, and classifies the review as recommended if the average OO is

positive, not recommended otherwise.

TC system for Arabic language may not consider easy task compare with

English language. Because the Arabic language has very complex morphology

[15]. Moreover, the test system considers the most important stage in any IR

system. It is used to determine the efficiency of the system and helped to know

which system is better than other.

The study was presented review the key text classification techniques including

text model, feature selection methods and text classification algorithms in

building a text classification system. In addition, the text classification system

based on Mutual Information, K-Nearest Neighbour algorithm and Support

Vector Machine had been implemented. The data set was created from the

famous Reuters-21578 text classification collection. Furthermore, the

experiment result was shown that the Classification accuracy rates were 91.1

%. This was mentioned to obtain better performance than no feature selection

and improved the classification rate. Moreover, the SVM classifier gains the

higher performance compared to KNN classifier [28].

16

In 2011, the practical a new feature selection method (Auxiliary Feature) [9]

had been used. In addition the enhancement performance of Naive Bayes for

Text Classification was proved and an auxiliary feature method was proposed

to determine features by an existing feature selection method, and selected an

auxiliary feature which can reclassify the text space aimed at the chosen

features. In order to evaluate this experiment; the date set was choose 30000

junk mails and 10000 normal mails from CCERT. The result from this study

shows that the proposed method indeed improves the performance of naive

Bayes classifier.

Feature sub-set selection (FSS) is an important step for effective text

classification (TC) systems. Due to it may have a great effect on accuracy of the

classifier [29, 30]. However, there are many valuable studies that investigated

FSS metrics for English text classification tasks and, there are some works that

handles the FSS problem with The Arabic text classification tasks.

In recent years, there has been an increasing amount of literature on an empirical

comparison of seventeen FSS metrics for Arabic TC task using SVM5

classifier, the evaluation used an Arabic corpus that consists of 7842 documents

which are independently classified into ten categories. However, the result of

experiment was proven that Chi-square and Fallout FSS metrics work best for

Arabic TC tasks [30].

An improved KNN algorithm for text classification was proposed, which builds

the classification model by combining constrained one pass clustering algorithm

and KNN text categorization. Despite the KNN is a simple and effective method

17

for text classification, it has three drawback points: firstly, the complexity of its

sample similarity computing is huge; secondly, its performance is easily

affected by single training sample, thirdly, KNN consider a lazy learning cause

do not build the classification model. Moreover, to overcome these drawbacks,

the improved KNN algorithm was executed. In addition, this algorithm was

used Vector Space Model (VSM) to represent the documents. The result show

that the INNTC classier is much more effective and efficient than KNN [14].

GAMON, AUE [31] a novel project prototype based classifier for text

classification had been implemented. The basic idea behind the indicated

algorithm is based on which document categories in modelled by as set of

prototypes and their individual term subspace of the document category. The

classifier was tested using two English data sets then compared it’s perform

with other five classifiers: SVM, three prototype, KNN, KNN-model and

centroid classifier. The experiment result of the suggested classifier show that

the project prototype based classifier was achieved higher classifier accuracy at

a lower computation cost than the traditional prototype based classifier

especially for date includes interfering document classification.

2.1.2. Arabic Text Classification

The studies carried out for Arabic text classification consider very few

compared to other languages (like English). Due to the Arabic language has an

extremely rich morphology and complex orthography. However, there are some

related work had been proposed to classify Arabic documents:

18

It had been claimed that three classifiers KNN, NB and distance-based classifier

had been implemented for Arabic text classification. Every category was

represented as a vector of keywords in the distance-based and KNN. On the

other hand, the vectors with the NB were bag of word. The Dice measure was

used to calculate the similarity between them. In addition, the accuracy of the

classifier was tested using Arabic text corpus, which collected from online

magazines and newspapers. According to the result, the NB classifiers do better

than other two classifiers [32].

The researcher indicated to SVM algorithm. It had been implemented on Arabic

text classification. The paper pointed out to the SVM classifier achieved the

result better than other classifiers such as (Naïve Bayes & KNN). In addition,

the light stemming on Arabic TC tasks was evaluated with SVM classifier. As

a result, the light stemming did not enhance the performance of Arabic SVM

text classifier. On the other hand, Feature Subset Section (FSS) had been

implemented and improved the performance of Arabic SVM text classifier.

Furthermore, the best result had been achieved with two methods of feature

subset section included (Chi-square, NGL).Finally, a new Ant Colony Based

FSS algorithm (ACO) had been applied to achieve the greatest TC effectiveness

of the six methods of FSS [12].

The main object was compared automatic text classification using kNN,

Rocchio and NB classifier on the Arabic language [33]. Moreover the system

had been tested by using a corpus of 1445 Arabic text document. Additionally

two models were used; the first model was the Support Vector Machine (SVM).

It used to implement KNN and Rocchio classifiers. Each document was

represented as a vector of terms. The second model was probabilistic, which

19

used to execute NB classifier. In probabilistic model the probability of

document belonging to any class had been calculated. The document assigned

to class has maximum probabilistic. However, the experiments shown the Naïve

Bayes is the best performer followed by kNN and Rocchio.

The paper had been reported that comparison between two probabilistic

classifiers. In addition, the researchers mentioned to the multinomial model

were given the result better than the multivariate Bernoulli model at large

vocabulary sizes. In contrast when the vocabulary size is smaller the

multivariate Bernoulli model outperforms the multinomial model. Furthermore,

the results were tested on five real world corpora. As the evaluation of their

experimental proved that the multinomial model was reduced error by an

average of 27%, and sometimes by more than 50% [34].

It was implemented the probabilistic generative models called parametric

mixture models (PMMs). The main goal of PMMs was to avoid multiclass and

multi-labelled text categorization problems. In addition, the PMMs was

achieved good results compared on the binary classification, due to PMMs can

simultaneously detect multiple categories of text instead depend on binary

judgment. Furthermore, the PMMs approaches was applied by using World

Wide Web pages, and showed on its efficiency [27].

It had been demonstrated comparative study of generative models for document

clustering was used multinomial model. In addition, the comparative this model

with other two probabilistic models such as multivariate Bernoulli, and Von

Mises-Fisher [2003] (VMF) was performed by applying the cluster.

20

Unfortunately the Bernoulli model was the worst for text clustering. On the

other hand, the VMF model produced clustering results better than both

Bernoulli and multinomial models.

The literature, a novel mixture model method for text clustering was named

multinomial mixture model with feature selection (M3FS). The M3FS method

was used MMM instead using the Gaussian mixtures to improve text clustering

tasks. Prior studies that have noted no label in unsupervised text clustering was

the hard problem of feature selection. In order to overcome this problem the

M3FS was proposed to text cluster. Furthermore the results demonstrate that

M3FS method has good clustering performance and feature selection capability

[34].

The main idea was discussed by two problems, one is many irrelevant features

which may affect the speed and also compromise the accuracy of the used

learning algorithm. The second challenge is the presence of outliers, which

affects the resulting model’s parameters. For this reason, the researchers were

suggested apply an algorithm that partitions a given data set without a priori

information about the number of clusters. Furthermore novel statistical mixture

model, based on the Gamma distribution, which makes explicit what data or

features have to be ignored and what information has to be retained. The

performance of method of a finite mixture model by using different applications

with analysis of data, real data and objects shape clustering have been proposed.

Moreover the experiment was prove this approach has excellent modelling

capabilities and that feature selection mixed with outliers detection influences

significantly the clustering performance [24].

21

It had been discussed the history of naive Bayes in information retrieval, and

presents a theoretical comparison of the multinomial and the multi-variate

Bernoulli (again called the binary independence model) [31].

Compared to Indo-European languages (like English), the Arabic language has

an extremely rich morphology and a complex orthography. This is one of the

main reasons [17, 34, 35] behind the lack of research in the field of Arabic text

classification. However, many machine learning approaches have been

proposed to classify Arabic documents: Support Vector Machine (SVM)

classifier with the Chi-square feature extraction method [35]the Naïve Bayesian

method, k-Nearest Neighbours distance based classifiers, the Rocchio

Algorithm [31].

Sawaf, Zaplo and Ney [15] had used the maximum entropy method for Arabic

document clustering. Initially, documents were randomly assigned to clusters.

In subsequent iterations, documents were shifted from one cluster to another if

an improvement was gained. The algorithm terminated when no further

improvement could be achieved. Their text classification method is based on

unsupervised learning.

Duwairi [17] had proposed a distance-based classifier for Arabic text

classification tasks, where the Dice measure was used as a similarity measure.

According to work had been done by Duwairi, each category was represented

as a vector of words. In the training phase, the text classifier scanned training

documents to extract features that best capture inherent category specific

22

properties. Documents were classified on the basis of their closeness to the

feature vectors of the text.

El-Halees [34] was implemented a maximum entropy based classifier to classify

Arabic documents. Compared with other text classification systems (such as El-

Kourdi et al. and Sawaf et al.), the overall performance of the system was good

(in comparisons, the results were used as recorded in the published papers

mentioned above by El-Halees).

Hmeidi, Hawashin and El-Qawasmeh [13] reported a comparative study of

SVM and K-Nearest Neighbours KNN classifiers on Arabic text classification

tasks. The concluded proven that SVM classifier shows a better micro-

averaging F1-measure.

Al-Saleem [28] proposed an automated Arabic text classification using SVM

and NB classification methods. These methods were investigated on different

Arabic datasets. Several text evaluation measures had been used. The

experimental results against different Arabic text classification datasets showed

that SVM algorithm outperforms the NB with regards to all measures (recall,

precision and F-measure). The F-measure of SVM was 77.8% while 74% for

NB.

Al-Diabat et al, [12] had investigated the problem of Arabic Text Classification

(ATC) by using rule-based classification approaches. The performance of

different classification approaches that produce simple "IF-Then" knowledge to

find the most appropriate one to handle ATC problem were evaluated. Four

23

rule-based classification algorithms were investigated: One Rule, rule induction

(RIPPER), decision tree (C4.5), and hybrid (PART). Arabic data collection with

1526 text documents that belong to 6 categories was used. The results showed

that the hybrid approach of PART outperforms the rest of algorithms. The

average for precision was 61.9% and for recall 62.3%.

Wahbeh et al [19] had compared three text classification techniques, SVM, NB,

and C4.5. A set of Arabic documents collected from different websites was

used. Moreover, four categories, and used WEKA toolkit for running the

previous classifiers, were used. The word representation has been implemented

to represent the documents. The project have proposed an approach for ATC

using Association Rule Mining; their approach facilitated the discovery of

association rules for building a classification model for ATC. Three

classification methods had used that used association rules: ordered decision

list, weighted rules, and majority voting. The experimental results showed that

the majority voting method gave better results than other methods.

A novel batch mode active learning using SVM for Arabic text classification

has been presented, while there are no a lot of studies done in this area for the

Arabic language. The purpose of apply active learning is reduced the amount of

the data needed for the training phase. Thus the cost of manual annotating the

data will be less; also the process of learning can be hurried while the active

method is allowed to choose a data from which it learns [17].

As long as the feature selection is a key factor in the accuracy and effectives of

resulting classification, the author mentioned to Binary Particle Swarm

Optimisation (BPSO) as the feature selection for Arabic text classification. The

24

aim of apply Bpso/Knn is to find a good subset of feature to facilitate the task

of Arabic text classification. SVM, Naïve Bayes and C4.5 decision tree have

been applied as classification algorithms. However the suggest method was

effective and achieved satisfy outcome on classification accuracy [25].

It was reported that multiword features has implemented to improve Arabic text

classification. Multiword features are displayed as a mixture of word appearing

within windows of varying size. However, multiword features were applied

with two similarity functions: the dice similarity function and the cosine

similarity function to improve the outcome of Arabic text classification.

According to the results had achieved the dice function perform better than the

cosine function. With the dice similarity function, the frequencies of the features

in the document are ignored and only their existence is recognized [11].

The investigator concentrated on just a single label assignment. The goal of this

paper is to present and compare result obtained against Saudi Newspaper Arabic

text collection using SVM algorithm and NB algorithm. However, the

experiment shows that the SVM classifier achieved better result than NB

classifier [28].

Latent Dirichlet Allocation (LDA) had been proposed as text feature. LDA was

used to index and represent Arabic texts. However, the mine idea behind LDA

is that documents are represented as random mixtures over latent topics where

each topic is described by a distribution over words. SVM was used to apply

classification task. Moreover, LDA-SVM algorithm was achieved high

effectiveness for Arabic text classification, which exceeds without LDA, Naïve

Bayes and KNN classifiers [20].

25

2.2. Summary

In chapter 2, there are different text classification algorithms were described. A

little was explained general in text classification and other explicated specially

the Arabic languages. In addition, some papers with the Multinomial Mixture

model had been shown.

26

3. Chapter Three: Methodology

27

3.1. Introduction

There are many approaches can be used in text classification. KNN, Rocchio

and Naïve Bayes by using MMM model have been implemented. Moreover,

these algorithms have been applied to the same datasets.

The main aim of applying TC on the Arabic languages is to improve the

performance of information retrieved without TC. Many steps can be done to

implement the TC task. However, all phases have been explained in section 3.2.

An IR process starts with the submission of a query, which describes a user’s

topic and finishes with a set of ranked results estimated by the IR’s ranking

scheme to be the most relevant to the query [33].

Recall and Precision consider famous measures; these can be used to evaluate

any IR system. Furthermore, the efficiency of the system can be determined by

using those measures.

This chapter is divided into three main sections. The section 3.1 shows overview

about the project. Section 3.2 has been presented the main text classification

system architecture.Section3.3mentioned to the short summary of chapter three.

3.2. System Architecture

The text classification technique has been implemented by passing through

several phases. Moreover, these phases execute sequentially to facility the TC

task. Uncategorized documents were pre-processed by removing punctuation

marks and stopwords. Every document is then represented either as a vector of

words only or as a vector of word, their frequencies and number of documents

in which these words appeared (inverse document frequency). Stemming was

used to decrease the dimensionality of feature vectors of document. The

accuracy of the classifier is computed using recall, precision and F-measure.

28

However, the figure 3-1 below indicated to the main steps in classification

system.

Figure 3-1Text Classification Architecture

3.2.1. Arabic Corpus

The accuracy of the classifiers was tested using corpus contains of 1445

documents. The documents can be divided into nine categories: Computer,

Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.

Moreover, some of documents will be used for training classifier and sets for

testing classifier. Testing sets are the input documents need to be classified.

Training is set of documents or topics tagged with the correct classes. However,

the corpus and categories have been shown and explained with more details in

chapter 4.In addition the figure 3-2 show pre-processing steps:

29

Figure 3-2Pre-processing Steps

3.2.2. Pre-processing

The pre-processing can be defined as the process of filter out words, which may

not give any meaning to a text, also might not be useful in information retrieval

systems. These words are called stop word [30]. The purpose of applying pre-

processing is to transform documents into a suitable representation for

classification task. In addition, it is reduced size of information which may

make search operation faster. The pre-processing can be done as follows:

-The different documents which have format such as HTML, SGML and XML

are converted to plain text format.

-Tokenization: Tokenization divides the document to set of tokens (words).

-Digital and Punctuation mark have been removed from each document.

Normalization: It is a very essential phase that will reduce many words that

have the same meaning, but it was written in different forms. Arabic language

refers to a very common problem when a single word has been written in many

forms like إبدا -ابد -أبد (which mean in English start).

30

-Remove stopword: There are two kinds of the terms in any document; the

firstly, it has been called stopword which occur commonly in all documents and

may not give any mean to the document. The secondly, it can be described as

Keywords or features. However, stopwords such as (punctuation mark,

formatting tags, prepositions, pronouns, conjunction and auxiliary verb) have

been removed to reduce the text size and save the process time. Moreover, this

process is essential to removing these high-frequency words because they may

misclassify the documents.

Stemming: Stemming is another common pre-processing step. The steaming

phase in Arabic is complex than in English language, while (Arabic gender

word has forms for singular, dual and plural). The purpose of decrease the size

of the initial feature set is removed misspelling or word with the same stem. It

is necessary to enhanced performance of information retrieval system.

Moreover, there are some different approaches to perform stemming: the root-

based stemmer; the light stemmer; and the statistical stemmer. However, the

light stemmer has been used in this thesis as stemmer:

Step 1: Normalization

Remove diacritics (primarily weak vowels) except shada.

Replace آ,إ, and أ with ا.

Replace final ى with ي.

Replace final ة with ه.

Step 2: Waw Removal

Remove initial "and" و if the remainder of the word is 3 or more characters long.

Step 3: Prefix Removal

31

Remove any of the definite articles (prefixes) if this leaves 3 or more characters.

Step 4: Suffix Removal

List of suffixes are indicated in Table 3-1, removing any suffix that are found

at the end of the word, if this leaves 3 or more characters.

Table 3-1Strings removed by light stemming

Prefixes فال,كال, بال,وال, ال

Suffixes ي, ه, يه, ين, ون,ات, ان,ها

Indexing:

Document index is one of the most substantial issues in TC, which includes

document representation and a term weighting scheme. The bag of word is the

most common way to represent the context of text. This approach consider

simplicity because it is recorded only the frequency of word in document.

Moreover, all the predefine categories, the synonyms and prefix words for the

category are found and it helps to assign any document to that category based

on the synonym or prefix of a term.

Several measures had been applied to calculate the weighted terms:

Term Frequency (TF): is the simplest measure to weight each term in a text.

The drawback with TF is concerns on term occurrence with in a text, this causes

in improve recall but does not improve precision according to the result was

achieved.

Inverse Document Frequency (IDF): the other popular weight measure is

IDF. The mine idea with IDF is concerns on the terms which rarely occur in a

collection of text. This perform in improve precision without enhance recall.

TF.IDF: as long as terms weight is effect on evaluated the text classification.

TF.IDF is combined two weight measures TF and IDF to enhance both recall

32

and precision then to enhance the text classification result. On the other hand in

TF.IDF when a new document occurs, recalculation of weighting factor to all

documents is needed since it depends on number of documents.

TF-IDF has been used in this work as one of the most popular weight schemes.

It considers not only term frequencies in a document, but also the frequencies

of a term in the entire collection of documents classic TF − IDF t,d assigns to

term t a weight in document d as showing in equations 3.1 and 3.2 [33]:

TFIDF(i, j) = TF(i, j). IDF(i) 3.1

Thus, TF*IDF weighting assigns a high degree of importance to terms occurring

frequently only in few documents of a collection. Inverse Document Frequency

(IDF) for term Ti calculated as fallowing:

IDF(i) = logN

DF(i)n 3.2

Where, DF(i)(document frequency of term Ti) is

number of documents in which Ti occurs.

Feature Selection:

Feature sub-set selection (FSS) is one of important pre-processing steps of

machine learning and essentially a task for text classification. Feature selection

methods study how to choose a subset of attributes that are used to construct

33

models describing data [15].There are many methods of FSS have been applied

on Arabic text [12,26].

According to the previous relation works the FSS approach was proven to

provides several advantages for text classification system because it has very

effective in reducing dimensionality, removing irrelevant and redundant terms

from documents and decrease computational complexity. In addition the FSS

increasing learning accuracy, improving classification efficiency and scalability

by make building the classifier is usually simpler and faster).

Many FSS algorithms have been tested and comparison in text classification

system for example: Chi-square and fallout were achieved the satisfy result in

Arabic TC tasks and Ant colony (ACO) is an optimization algorithm ,which is

derived from the study of real ant colonies ant it is one of the hopeful approaches

to better feature selection.

To classify a new document was pre-processed by removing punctuation marks

and stopwords, followed by extracting the roots of the remaining keywords. The

feature vector of a new document and the feature vector of all categories should

be compared. Ultimately, the document was assigned to the category with has

maximum similarity.

3.2.3. Classifiers

There are a lot of classifiers type have been applied and excused in text

classification area. Moreover, the result was completely different from one to

another, while every classifier has specific algorithm. However, many sort of

classifier has been explained and shown the advantages and drawback

according to the achieved result.

34

3.2.3.1. Support Vector Machine (SVM)

Support vector machine has been widely applied in text classification area [12,

20, 24]. SVM classifier is one of supervised machine learning techniques. .The

document is represented as a vector of terms (words).each dimension

corresponds to a separate term. When the term occurs in the document the value

of the vector is not null but this value can calculated using the best weigh

methods such as tf*idf. In linear classification, SVM creates a hyper plane that

divide the data into two sets with the maximum-margin. Hyper plane with the

maximum margin has the distance from the hyper plane to points when the two

sides are equal. To apply the SVM learn the function bellow can be used as

showing in equation 3.3 [24]:

𝒇(𝒙) = 𝒔𝒊𝒈𝒏(𝒘𝒙 + 𝒃) 3.3

Where W is a weighted vector in nR .SVM find the hyper plane bwxy by

separating the space nR into two half spaces with the maximum-margin [24].

The SVM classifier is one of the simple and effective algorithms that execute

classification task. In addition SVM classifier has the potential to handle a huge

number of features. On the other hand with the svm classifier the document

containing similar contexts but different terms vocabulary are not classified as

the same category. Also in the vector representation the order of the terms which

appear in the document was lost.

The goal of this project is to compare three different classification techniques

on the Arabic language, namely kNN, Rocchio and Naïve Bayes by using

multinomial model.

35

3.2.3.2. K-Nearest Neighbour (KNN)

The K-nearest neighbour (KNN) is one of the famous text classification

techniques. The principle KNN technique: all the documents that are close in

the space belong to the same class. However, the essential idea with KNN is to

identify the class of document based on the similarity measure.

The KNN has advantages such as simple, non-parameter and shows a very good

performance on text categorization tasks for Arabic text Language. On the other

hand, the KNN has drawbacks such as difficult to find optimal value of k;

classification time is long due to the distance of each query instance to all

training samples has been computed. In addition, this classifier has been called

lazy learning system, because it does not involve a true training phrase [13].

The Major Steps to Apply K-Nearest Neighbour Classifier:

Pre-process documents in training set.

Choose the K parameter value, K value means the number of nearest

neighbours of d in the training data.

Determine the distance between the testing document (d) and the training

documents (previous classes).

Class the distance and determine neighbours based on the minimum distance of

k-distances.

To classify an unknown document, the KNN classifier ranks the document’s

neighbours among the training documents and uses the class labels of the k most

similar neighbours. The similarity score of each nearest neighbour document to

the test document is used as the weight classes for the neighbour document. If

a specific category is shared by more than one of the k-nearest neighbours, then

the sum of the similarity scores of those neighbours is obtained from the weight

of that particular shared category [17].

36

An example of KNN classification has been showed in figure3-2.a. Moreover,

the document X has been assumed as test sample, which should be classified

either to the first category of white circle or to the second category of black

circle. If k = 1 the document X will be classified to the white category, because

there is one white circle and zero black circle inside the inner circle. If k = 5 it

is classified to black category, because the number of black circle more than of

white circle. Majority voting was used to determine the category of an

unclassified document. On the other hand, if K=10 the document will be

classified to both (black, white).To avoid this problem the similarity has been

determine according to total weight of two categories as show in figure3-2.b.

Figure 3-3 An example for KNN

If K=5 the document has been classified to white category, because the sum

weight of white category (9) more than black’s weight (8).

3.2.3.3. Rocchio:

Rocchio relevance feedback algorithm is one of the most popular and widely

applied learning methods from information retrieval. In addition, the Rocchio

37

is considered easy to implement and very fast compared to KNN [33]. The basic

idea behind applying the Rocchio approach is uses a vector to represent each

document and class. The vector to represented the classes ( c

) has been called

prototype or centroid [5, 33].

Prototype for each class calculated by subtract the average all document

appeared in class Cj of the average all document do not appears in the classCj.

jj CDdjCdj

j dCD

dC

c

||

1

||

1

3.4

Where, α&β are parameters that adjust the relative impact of positive and

negative training examples.

Practically in text classification, Rocchio calculates similarity between test

document and each of prototype vectors. Then, the test document assigns to the

category which has the maximum similarity score.

3.2.3.4. Naive Bayes:

Naive Bayes classifier uses a probabilistic model of text. It achieves good

performance results on TC task for Arabic text [18].

The NB mentioned that it is a simple probabilistic classifier based on applying

Bayes’ theorem the condition probability P (Cj|di) for each class can be

computed from equations 3.5 and 3.6 [8, 9]:

p(cj|di) =

P(Cj)P(di|cj)

P(di)

3.5

38

Where, P (cj) is the prior probability of a document occurring in class

cj.Frequently, each document (di) in text classification represented as a vector

of words (v1 , v2, … … . vt) then the above equation become as:

p(cj|di) = P(Cj) ∗

∏k=1t P(vk|cj)

P(di)

3.6

The Naïve Bayes is frequently used in text classification; due to it has speed

and simplicity. Moreover, there are two event models of Naïve Bayes:

multinomial model and Bernoulli model [34].

In Bernoulli model, a test document is classified as binary occurrence

information, the number of occurrences is ignored. Although multinomial

model is kept tracking of multiple occurrences.[35].

3.2.3.5. Multinomial Mixture Model

It is necessary to clarify exactly what is meant by MMM. It can be defined as

the distribution of words in a document as a multinomial. Furthermore, a

document is treated as a sequence of words and it is assumed that each word

position generated independently of every other [30]. In text classification, the

use of class-conditional multinomial mixtures can be seen as a generalization

of the Naive Bayes text classier relaxing its (class-conditional feature)

independence assumption [29].When a test document is classified, an MMM

keeps track of multiple occurrences compared with another model such as

Bernoulli model [31]. The Bernoulli model uses binary occurrence information

and ignores the number of occurrences. As long as an MMM keeps the

39

occurrence from all words (frequency, position), thus, this makes the

classification task easier according to equations 3.7, 3.8, 3.9, 3.10 and 3.11 [35].

p(cj|di) =

P(Cj)P(di|cj)

P(di)

3.7

𝐏(𝐂𝐣)=𝑁𝐶

N 3.8

Where, 𝑁𝐶 is number of the document in class.C𝑗.N is number of document in

collections.

p(cj|di) = P(Cj) ∗

∏k=1t P(wc|cj)

P(di)

3.9

P(w𝑐|cj) =

count (w, cj) + 1

count (𝑐) + │𝑉│

3.10

Where,count (w, cj) is frequency word W in class cj . count (𝑐) is total number

of the words in classcj. │𝑉│is Total number of the words in collection.

The Bayes classifier compute separately the posterior of document D falling

into each class, and assign the document to the class with the highest

probability, that is

coptimal| = arg max (p(cj|di); 1 ≤ i ≤ |C 3.11

Where, |C| is the total number of classes

40

3.2.4. Evaluation

As long as there are many retrieval systems on the market, but which one is the

best. It depends on the result which proposed from every one. An important

issue for information retrieval systems is the notion of relevance. The purpose

of an information retrieval system is to retrieve all the relevant documents

(recall) and no non-relevant documents (precision). Recall and precision are

defined as:

Precision: The ability to retrieve top-ranked documents that are mostly

relevant.

Precision =Numberofrelevantdocumentsretrieved

Totalnumberofdocumentsretrieved

3.12

The maximum (and optimal) precision value would be 100% and the worst

possible precision of 0% is achieved when not a single document was retrieved.

Recall: The ability of the search to find all of the relevant items in the corpus.

Recall =Numberofrelevantdocumentsretrieved

Totalnumberofrelevantdocuments

3.13

The perfect information retrieval system can be achieved when the result of both

recall and precision equal one.

F1-measure: As a measure of effectiveness that combines the contributions of

precision and recall. The well-known F1 measure function is used to test

perform of the Information retrieval systems [33], which defined as:

RePr

Re.Pr21

F

3.14

41

Fallout: It is another evaluated measure can be used to evaluate the Information

Retrieval systems. Although, Recall and Precision consider the good evaluation

measure but they do not care on number of irrelevant documents in the

collection, that caused to undefined recall when there is no relevant document

in the collection, also to undefined precision when no document is retrieved.

However, Fallout number of irrelevant documents in the collection had been

taken in account. In another word the Fallout is inverse of Recall, that is indicate

to a good system should have high recall and low fallout.

3.3. Summary

This chapter gives some introduction to information retrieval, and describes the

common tasks of a TC system. Using multinomial mixture model as a machine

learning algorithm is nowadays the most popular approach. In the rest of chapter

three interesting kinds of TC algorithms have been described briefly.

42

4. Chapter Four: Experiments and Evaluation

43

4.1. Introduction

Automatic Text Classification is defined as classifying unlabelled documents

into predefined categories based on its contents. It has become an important

topic due to the increased number of documents on the internet that people have

to deal with daily; this in itself has led to the urgent need of organizing them. In

this chapter, experiments will be achieved then the performance of the Rocchio

algorithm with traditional k-NN and Naïve Bayes using MMM classifiers will

be documented.

These classifiers will be evaluated by some measures in order to know whether

Naïve Bayes using MMM outperforms the other classifiers. The rest of this

chapter will be organized as following: section 4.2 will discuss the preparing

process for data set evaluation. Section 4.3 will list the performance measures.

Section 4.4 will discuss the evaluation results. Section 4.5 will discuss the

results of MMM with 5070 documents. In section 4.6 will show the summary.

4.2. Data Set Preparation

The corpus has been downloaded from [34]. The documents classified into nine

categories. The categories and number of documents of each one of them

appears in table 4-1. The total number of documents is 1445. The length of

documents is varying from each other. The nine categories are: Computer,

Economics, Education, Sport, Politics, Engineer, Medicine, Law, and Religion.

After the pre-processing achieved on all the documents, a copy of these pre-

processed documents have been converted into Attribute-Relation File Format

(ARFF) in order to be suitable for Weka tool.

44

Table 4-1Number of Documents for each Category

NO Category Number

1 Medicine 232

2 Economics 222

3 Religion 222

4 Sport 232

5 Politics 481

6 Engineer 441

2 Law 72

8 Computer 22

7 Education 88

4.3. Performance measures:

Computational efficiency and classification effectiveness is what it meant of

performance of text classification algorithm. So, when a large number of

documents categorized into many categories, the efficiency of text classification

will be take into account. The effectiveness of text classification will be

measures by precision and recall.

Precision and Recall is defined as follows:

Recall =tp

tp+fptp + fp>0, 4.1

Precision =tp

tp + fntp + fn > 0 , 4. 2

45

Where,𝐭𝐩 counts the number of documents that classified by classifier correctly,

while 𝐟𝐧 counts the number of documents that classified by classifier

incorrectly, fp counts the number of documents that not classified by classifier

correctly counts the not assigned but incorrect cases and tn counts the not

assigned and correct cases. As showed in table 4-2.

Table 4-2 Confusion Matrix for Performance Measures

Classifier

Decision

Correct Decision By Expert

YES is correct NO is incorrect

Assigned YES Tp Fn

Not Assigned

NO

Fp tn

Recall is the fractions of relevant instances that are retrieved as it appear in

equation4.1, while Precision is the fraction of retrieved instances that are

relevant as it appear in equation4.2.Both precision and recall are therefore based

on an understanding and measure of relevance. Precision and recall values often

depend on parameter tuning; that’s mean there is a trade-off between precision

and recall. This is why another measure that combined both of the precision and

recall used: the F-measure which is defined as follows:

F − measure = 2(Precision × Recal) ̸(Precision + Recall) 4.3

For obtaining estimates of precision and recall relative to the whole category

set, two different methods may be adopted:

http://en.wikipedia.org/wiki/Relevance

46

Table 4-3The Global Contingency Table

Category set

C={ c1,....,c|C }

Expert Judgments

YES NO

Classifier

Judgments

YES

||

1

iTPTPC

i

||

1

iFNFC

i

N

NO

||

1

iFPFPC

i

||

1

iTNTC

i

N

Macroaveraging: precision and recall are first evaluated locally for each

category, and then globally by averaging over the results of the different

categories.

Table 4-4Macro-Average

Precision Recall

Macroaveraging

||||

Pr||

1

||

1

C

FNTP

TP

C

C

i ii

iC

i

i

||||

Re||

1

||

1

C

FPTP

TP

C

C

i ii

iC

i

i

Microaveraging: precision and recall are obtained by globally summing over

all individual decisions. For this, the global contingency table of table 4-4,

obtained by summing over all category-specific contingency tables, is needed.

47

Table 4-5Micro-Average

Precision Recall

Microaveraging

||

1

||

1

)(C

i

ii

C

i

i

FNTP

TP

FPTP

TP

||

1

||

1

)(C

i

ii

C

i

i

FPTP

TP

FNTP

TP

Macro- and Micro-averaging formulas for precision and recall are shown in

tables4-4 and 4-5.

There are some differences between Micro-averaged and macro-averaged. The

dissimilarities between two of them can be large. Micro-averaged results give

equal weight to the documents and thus emphasize larger topics, while macro-

averaged results give equal weight to the topics and thus emphasize smaller

topics more than micro-averaged results. As a result, the ability of a classifier

to behave well on categories with low generality (categories with few positive

training instances) will be emphasized by macro-averaging and much less so by

micro-averaging. Micro-averaged results are therefore really a measure of

performance on the large classes in a test collection [32]. To get a sense of

performance on small classes, macro-averaged results should be computed.

Whether one or the other should be used obviously depends on the application

requirements.

In single-label classification, micro-averaged precision equals recall, and is

equal to F1, so only micro F1 will be noted for the micro-averaged results.

48

4.4. Evaluation Results

The results were obtained for each of the k-nearest neighbor, Rocchio, and

Naïve Bayes using MMM as follow:

4.4.1. Naive Bayes Algorithm Using (MMM).

Table4-6shows the confusion matrix for Naïve Bayes using MMM algorithm.

The numbers reported in an entry of a confusion matrix correspond to the

number of documents that are known to actually belong to the category given

by the row header of the matrix, but that are assigned by NB using MMM to the

category given by the column header.

As shown in the table4-6; 67 documents of category Computer are classified

correctly into Computer category while 3 documents of Computer classified

incorrectly where 2 of these 3 documents classified as Education and 1 from 3

classified as law. The best classification at category is Sport where 231 of this

category classified correctly. Lowest value of correctly classified documents for

Education category where 56 documents classified correctly and 12 documents

classified incorrectly.

49

Table 4-6Confusion Matrix Results for NB Using MMM Algorithm

Figure4-1 shows recall, precision and f-measure for every category when the

Naïve Bayes classifier was used, the precision reach it is highest value(1) for

the Sport , and computer categories while the lowest value of precision was

(0.812) for education category. Recall reaches its highest value (0.996) for sport

category and its lowest value for law category (0.804). F-measure reaches its

heights value (0.998) for sport category and its lowest value for education

category (0.818). The rest of the figure is self-exploratory.

50

Table 4-7 Confusion Matrix Results for NB Algorithm

The figure 4-1 shows the precision, recall, and f-measure for all the categories

that classified using Naïve Bayes by using MMM.

Precision Recall f-measure

Actual Computer 1 0.957 0.978

Economy 0.864 0.841 0.852

Education 0.812 0.824 0.818

Engineer 0.948 0.948 0.948

Law 0.839 0.804 0.821

Medicine 0.996 0.991 0.993

Politics 0.833 0.918 0.873

Religion 0.905 0.885 0.895

Sport 1 0.996 0.998

51

Figure 4-1Result Of the Naive Bayes – MMM Classification Algorithm

Table 4-8 shows the average of the above values for all categories in MMM

algorithm, the overall f-measure is 0. 908. This value consider high.

Table 4-8NB Using MMM Classifier Weighted Average for the Nine Categories

Naïve Bayes using

MMM

Precision Recall F-measure

Weighted average 0.911 0.907 0.908

4.4.2. Comparisons MMM With Other Techniques and

Discussions 0f Results

Firstly a comparison made between k-NN, Rocchio and Naïve Bayes

classifiers. All the results of KNN and Rocchio have been taken from [33]. A

summary of the recall, precision and F1 measures are shown in table 4.9.

Naïve Bayes gave the best F-measures with MiF1=0.9185 and MaF1=0.908,

followed by kNN widf with MiF1=0. 7970 and MaF1=0. 7871, closely

0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

f-measure

Precision, Recall, and F- measure

52

followed by Rocchio tf.idf with MiF1=0. 7314 and MaF1=0. 7882. A

comparison of values of MiF1 and MaF1 is shown in figure 4-2.

Table 4-9Classifier Comparison

Method maP MaR maF1 miF1

kNN tf 0.7100 0.5359 0.6100 0.5711

kNN tfidf 0.8363 0.6902 0.7562 0.7272

kNN widf 0.8094 0.7662 0.7871 0.7970

Rocchio tf 0.5727 0.4501 0.5022 0.4427

Rocchio tfidf 0.8515 0.7337 0.7882 0.7314

Rocchio widf 0.7796 0.7199 0.7484 0.6968

Naïve Bayes 0.911 0.907 0.908 0.9185

The figure 4-2, show the maFi, miF1 for all the classifiers (KNN, Rocchio, and

Naïve Bayes), from the figure, we can see that the Naïve Bayes using MMM

got the higher value for both values (maF1, and miF1).

Figure 4-2Maf1, Mif1 Comparison for Classifiers

00.10.2

0.30.40.50.60.70.80.9

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchiotfidf

Rocchiowidf

Naïve Bayes

maF1

miF1

53

The next figure 4-3 shows the macro precision of all the categories, and it

appears that the highest value is for the Naive Bayes using MMM, Rocchio, and

also KNN tf.idf not far away from Rocchio.

Figure 4-3Map Comparison for Classifiers

The next figure 4-4 shows the macro recall of all the categories, and it appears

that the highest value is for the naive Bayes using MMM, KNN and Rocchio is

not far away from KNN.

Figure 4-4 MaR Comparison for Classifiers

0.71

0.8363 0.8094

0.5727

0.85150.7796

0.911

0

0.2

0.4

0.6

0.8

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchiowidf

Naïve Bayes

maP

maP

0.5359

0.69020.7662

0.4501

0.7337 0.7199

0.907

0

0.2

0.4

0.6

0.8

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchio tfidf Rocchiowidf

Naïve Bayes

MaR

MaR

54

It is clear that Naive Bayes classifier has the high values for the three measures

and then KNN classifier comes in the second place, the worst values in the three

measures was for Rocchio. As observed also there is disproportion in the

precision, recall and f-measure values for the k-NN where it is reach to high

value (0.83) at precision measure and very low at recall (0.53). As shown also

the values of precision, recall and f-measure values for the other two classifiers:

Rocchio and Naïve Bayes classifiers more stability.

Figure 4-5Precision, Recall and F-Measure for the Three Classifiers

4.5. Results of Naïve Bayes algorithm (MMM) with 5070

documents

Another experiment has been conducted, the collected corpus has showed in

Table4-10 contains 5070 documents that vary in length these documents fall

into six categories: Business, Entertainment, Middle East news, Sport, World

news, and Science and Technology.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

kNN tf kNN tfidf kNN widf Rocchio tf Rocchiotfidf

Rocchiowidf

Naïve Bayes

maP

MaR

maF1

55

Table 4-10Categories and Their Distributions in the Corpus (5070 Documents)

NO Category Number

1 Business 836

2 Entertainment 474

3 Middle East news 1462

4 Sport 762

5 World news 1010

6 Science and Technology 526

Table 4-11 shows the confusion matrix for Naïve Bayes using MMM algorithm.

Lowest value of correctly classified documents for Entertainment category

where 400 documents classified correctly and 74 documents classified

incorrectly.

Table 4-11 Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents)

56

Figure 4-6 shows recall, precision and f-measure for every category when the

Naive Bayes classifier was used, the precision reach it is highest value(0. 991)

for the Sport category while the lowest value of precision was (0. 746) for

Entertainment category. Recall reaches its highest value (0. 979) for Sport

category and its lowest value for Middle East news category (0. 832). F-measure

reaches its heights value (0. 985) for Sport category and its lowest value for

Entertainment category (0. 792). The rest of the figure is self-exploratory.

Table 4-12Confusion Matrix Results for NB Algorithm in the Corpus (5070 Documents)

The next figure 4-6 shows the precision and recall for all the categories that

classified using Naïve Bayes by using MMM.

57

Figure 4-6Result Of the Naive Bayes Classification Algorithm

Table4-13 shows the average of the above values for all categories in NB

algorithm, the overall f-measure is 0. 884.

Table 4-13Nab Using Mmm Classifier Weighted Average for the Six Categories in the Arabic Corpus

(5070 Documents)

Naïve Bayes using

MNM

Precision Recall F-measure

Weighted average 0.882 0.890 0. 884

Comparing the overall result from table 4-8 and 4-13 show that there is a little

degrades in performance of precision, recall, and F-measure. It’s because the

testing set still as its 4-fold cross validation. If the test set was in percent, the

results will be different since the classifier will learn more.

0

0.2

0.4

0.6

0.8

1

1.2

Precision

Recall

58

4.6. Summary

The Naive Bayes using MMM outperformed k-NN and Rocchio classifiers.

Naive Bayes (MMM) classifier has the best result and then the other techniques

came after Naive Bayes.

59

5. Chapter Five: Conclusion and Future Work

..

60

5.1. Conclusion:

Text classification for Arabic languages has been investigated in this project.

Three classifiers were compared: KNN, Rocchio and Naive Bayes using

Multinomial Mixture Model (MMM).

Unclassified documents were pre-processed by removing stopwords and

punctuation marks. The rest of words was stemmed and stored in feather

vectors. Every test document has its own feature vector. Finally the document

will be classified to the best class according to the classifier technique.

The accuracy of classifiers has been measured using recall, precision and F-

measure. For project experiments the classifiers were tested using 1445

document. The result shows that the performance of NB by using Multinomial

model outperformed the other two classifiers.

5.2. Future Work:

As a future work, we plan to continue working with Arabic text categorization

as this area not widely explored in the literature and trying the classifiers on a

huge collection.

Apply an auxiliary feature method with Multinomial model in order to improve

classification accuracy.

Comparing the Naïve Bayes MMM model with different models such as the

multivariate Bernoulli [9].

Evaluate Bpso feature selection with Multinomial classifier by using the same

Arabic database was mention it in [35], then compare the result between the two

achieved result.

61

Reference

[1] Hasan, M.M.: ‘Can Information Retrieval techniques automatic assessment challenges?’, in Editor (Ed.)^(Eds.): ‘Book Can Information Retrieval techniques automatic assessment challenges?’ (2009, edn.), pp. 111-338

[2] Ghwanmeh, S., Kanaan, G., Al-Shalabi, R., and Ababneh, A.: ‘Enhanced Arabic Information

Retrieval System based on Arabic Text Classification’, in Editor (Ed.)^(Eds.): ‘Book Enhanced Arabic Information Retrieval System based on Arabic Text Classification’ (2007, edn.), pp. 461-465

[3] Duwairi, R.: ‘Arabic Text Categorization’, International Arab Journal on Information

Technology, 2007, 4, (2)

[4] Ko, Y., Park, J., and Seo, J.: ‘Improving text categorization using the importance of sentences’, Information Processing & Management, 2004, 40, (3), pp. 65-79

[5] Ko, Y., and Seo, J.: ‘Text classification from unlabeled documents with bootstrapping and

feature projection techniques’, Information Processing & Management, 2009, 45, (3), pp. 70-83

[6] Chen, J., Huang, H., Tian, S., and Qu, Y.: ‘Feature selection for text classification with NaÃ¯ve

Bayes’, Expert Systems with Applications, 2009, 36, (3, Part 1), pp. 5432-5435

[7] Mesleh, A.M., and Kanaan, G.: ‘Support vector machine text classification system: Using Ant Colony Optimization based feature subset selection’, in Editor (Ed.)^(Eds.): ‘Book Support vector machine text classification system: Using Ant Colony Optimization based feature subset selection’ (2008, edn.), pp. 341-148

[8] Duwairi, R.M.: ‘Arabic Text Categorization’, Int. Arab J. Inf. Technol., 2007, 4, (2), pp. 325-

132

[9] Duwairi, R.M.: ‘Machine learning for Arabic text categorization’, Journal of the American Society for Information Science and Technology, 2006, 57, (8), pp. 1005-1010

[10] Abboud, P.F., and McCarus, E.N.: ‘Elementary Modern Standard Arabic: Volume 3,

Pronunciation and Writing; Lessons 1-10’ (Cambridge University Press, 3981. 3981)

[11] Chen, A., and Gey, F.C.: ‘Building an Arabic Stemmer for Information Retrieval’, in Editor (Ed.)^(Eds.): ‘Book Building an Arabic Stemmer for Information Retrieval’ (2002, edn.), pp.

[12] Zrigui, M., Ayadi, R., Mars, M., and Maraoui, M.: ‘Arabic Text Classification Framework Based

on Latent Dirichlet Allocation’, Journal of Computing and Information Technology, 2032, 20, (2), pp. 125-140

62

[13] Deisy, C., Gowri, M., Baskar, S., Kalaiarasi, S., and Ramraj, N.: ‘A novel term weighting scheme MIDF for Text Categorization’, Journal of Engineering Science and Technology, 2010, 5, (1), pp. 94-107

[14] https://sites.google.com/site/motazsite/Home/osac, 2010

[15] ZRIGUI, M., AYADI, R., MARS, M. & MARAOUI, M. 2012. Arabic Text Classification

Framework Based on Latent Dirichlet Allocation. Journal of Computing and Information

Technology, 20, 125-140.

[16] SETTLES, B. 2010. Active learning literature survey. University of Wisconsin, Madison.

[17] NOAAN, H. M., ELMOUGY, S., GHONEIM, A. & HAMZA, T. Naive Bayes Classifier Based Arabic

Document Categorization. Informatics and Systems (INFOS), 2010 The 7th International

Conference on. IEEE, 1-5.

[18] Pang, B., Lee, L., and Vaithyanathan, S.: ‘Thumbs up?: sentiment classification using machine

learning techniques’, in Editor (Ed.)^(Eds.): ‘Book Thumbs up?: sentiment classification using machine learning techniques’ (Association for Computational Linguistics, 2002, edn.), pp. 79-86

[19] Pang, B., and Lee, L.: ‘Opinion mining and sentiment analysis’, Foundations and trends in

information retrieval, 2008, 2, (1-2), pp. 1-135

[20] Turney, P.D.: ‘Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews’, in Editor (Ed.)^(Eds.): ‘Book Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews’ (Association for Computational Linguistics, 2002, edn.), pp. 417-424

[21] Gamon, M., and Aue, A.: ‘Automatic identification of sentiment vocabulary: exploiting low

association with known sentiment terms’, in Editor (Ed.)^(Eds.): ‘Book Automatic identification of sentiment vocabulary: exploiting low association with known sentiment terms’ (Association for Computational Linguistics, 2005, edn.), pp. 57-64

[22] Mesleh, A.M.d.: ‘Feature sub-set selection metrics for Arabic text classification’, Pattern

Recognition Letters, 32, (14), pp. 1922-1929

[23] Zhang, J., Chen, L., and Guo, G.: ‘Projected-prototype based classifier for text categorization’, Knowledge-Based Systems, 2013, 49, (0), pp. 179-189

[24] Duwairi, R.: ‘Arabic text categorization’, the international Arab Journal of information

Technology, 2007, 7

63

[25] Kanaan, G., Alâ€•Shalabi, R., Ghwanmeh, S., and Alâ€•Ma'adeed, H.: ‘A comparison of

textâ€•classification techniques applied to Arabic text’, Journal of the American Society for Information Science and Technology, 2009, 60, (9), pp. 1836-1844

[26] El-Halees, A.: ‘Arabic text classification using maximum entropy’, The Islamic University

Journal (Series of Natural Studies and Engineering) Vol, 2007, 15, pp. 157-167

[27] MESLEH, A.M.: ‘Chi square feature extraction based SVMs Arabic language text categorization system’, Journal of Computer Science, 2007, 1, (6), pp. 410

[28] Al-Shalabi, R., Kanaan, G., and Gharaibeh, M.: ‘Arabic text categorization using kNN

algorithm’, Proc. 4th Internat. Multiconf. on Computer Science and Information Technology (CSIT 2006), 2006,

[29] Syiam, M.M., Fayed, Z.T., and Habib, M.: ‘An intelligent system for Arabic text

categorization’, International Journal of Intelligent Computing and Information Sciences, 2006, 6, (1), pp. 1-19

[30] Sawaf, H., Zaplo, J., and Ney, H.: ‘Statistical classification methods for Arabic news articles’,

Natural Language Processing in ACL2001, Toulouse, France, 2001

[31] Hmeidi, I., Hawashin, B., and El-Qawasmeh, E.: ‘Performance of KNN and SVM classifiers on full word Arabic articles’, Advanced Engineering Informatics, 2008, 22, (3), pp. 306-111

[32] Alsaleem, S.: ‘Automated Arabic Text Categorization Using SVM and NB’, Int. Arab J. e-Technol., 2011, 2, (2), pp. 124-128

[33] Al-diabat, M.: ‘Arabic Text Categorization Using Classification Rule Mining’, Applied

Mathematical Sciences, 2012, 6, (81), pp. 4033-4046

[34] Al-Kabi, M., Wahsheh, H., Alsmadi, I., Al-Shawakfa, E., Wahbeh, A., and Al-Hmoud, A.: ‘Content-based analysis to detect Arabic web spam’, Journal of Information Science, 2012, 38, (3), pp. 284-296

[35] Mitra, V., Wang, C.-J., and Banerjee, S.: ‘Text classification: A least square support vector

machine approach’, Applied Soft Computing, 2007, 7, (1), pp. 908-914

الرسالة 15 1 2014

Documents

Transcript of الرسالة 15 1 2014