An experimental comparison of naive bayesian and keyword based
-
Upload
eraser60913 -
Category
Technology
-
view
487 -
download
1
Transcript of An experimental comparison of naive bayesian and keyword based
![Page 1: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/1.jpg)
An Experimental Comparison of Naive Bayesian and Keyword-Based Anti-Spam Filteringwith Personal E-mail Messages
Author:
Ion Androutsopoulos , John Koutsias ,Konstantinos V. Chandrinos, Constantine D. Spyropoulos
Resourse: sigir2000
![Page 2: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/2.jpg)
Outline Introduction Feature selection The Naive Bayesian classifier Result
![Page 3: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/3.jpg)
Introduction
垃圾郵件很多 Naïve Bayesian classifier 與 keywork-based 的反垃圾郵
件機制做比較 . Sahami et al. trained a Naïve Bayesian classifier on
manually categorized legitimate and spare messages
![Page 4: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/4.jpg)
The Naive Bayesian classifier
x = (xl , x2 , x 3 .... , xn ) , where xl ,….., xn are the values of attributes X 1 .... , X n .
Each attribute shows whether or not a particular word (eg. "adult") is present in the message.
Use additional attributes corresponding to phrases(e.g. "be over 21") .
Non-textual properties (e.g. whether or not the message contains attachments).
![Page 5: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/5.jpg)
mutual information Use mutual information ( MI ) to select possible attributes. MI(X;C):
Then select the attributes with the highest mutual
information values.
![Page 6: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/6.jpg)
The Naive Bayesian classifier
![Page 7: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/7.jpg)
![Page 8: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/8.jpg)
![Page 9: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/9.jpg)
S -> L (legitimate to spam) L->S(spam to legitimate) denote the two error types.
we assume that L->S is times more costly than S -> L
Classify a message as spare if the following classification criterion is met:
![Page 10: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/10.jpg)
![Page 11: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/11.jpg)
= 999 (t=0.999) , This means that mistakenly blocking a legitimate message was taken to be as bad as letting 999 spare messages pass the filter.
= 9 (t=0.9) , 若郵件被 blocked 時 , 回傳給 sender道歉訊息以及猜謎 .
= 1(t=0.5), If the recipient does not care about the extra work imposed on the sender.
![Page 12: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/12.jpg)
Result
![Page 13: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/13.jpg)
1789 messages, consisting of 211 legitimate messages that users had saved and 1578 spare messages.
First experiment word-attributes were used. Candidate attributes were added (e.g. corresponding to the
phrases "be over 21", "only $"). Third experiment, (e.g. whether or not the message contains
attachments, or a high proportion of non alphanumericcharacters).
![Page 14: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/14.jpg)
Experiments with the PU1 corpus 481 spam messages. 618 legitimate messages. Naive Bayesian classifier, ten-fold cross validation to reduce random variation. That Results were then averaged over the ten runs. varied the number of retained attributes from 50 to 700
by a step of 50 lemmatizer and stop-list
![Page 15: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/15.jpg)
![Page 16: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/16.jpg)
![Page 17: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/17.jpg)
![Page 18: An experimental comparison of naive bayesian and keyword based](https://reader033.fdocuments.net/reader033/viewer/2022060123/5596c9ae1a28ab9d198b459a/html5/thumbnails/18.jpg)