PPM based Spam Filtering in SEWM2008

12
PPM based Spam Filtering in SEWM2008 Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong [email protected],[email protected] ,[email protected] [email protected] College of Computer Science, Zhejiang University April 10, 2008

description

PPM based Spam Filtering in SEWM2008. Liu JuXin, Xu Congfu, Peng Peng, Lu Guanzhong [email protected],[email protected] ,[email protected] [email protected] College of Computer Science, Zhejiang University April 10, 2008. Outline. PPM( prediction by partial matching ) - PowerPoint PPT Presentation

Transcript of PPM based Spam Filtering in SEWM2008

Page 1: PPM based Spam Filtering in SEWM2008

PPM based Spam Filtering

in SEWM2008Liu JuXin, Xu Congfu, Peng Peng, Lu

Guanzhong

[email protected],[email protected],[email protected] [email protected]

College of Computer Science, Zhejiang UniversityApril 10, 2008

Page 2: PPM based Spam Filtering in SEWM2008

Outline

PPM( prediction by partial matching ) Email Pre-processing Train PPM Model Model Classification

Page 3: PPM based Spam Filtering in SEWM2008

PPM

Data Compression

Page 4: PPM based Spam Filtering in SEWM2008

PPM Framework

Page 5: PPM based Spam Filtering in SEWM2008

Email Pre-processing

Source alphabet Merge continuous spaces Truncate long messages

Page 6: PPM based Spam Filtering in SEWM2008

Email Pre-processing

Raw DataAbcd_= - Af?/[]=+ safj =ab fe addfe

Sample:Alphabet : {a,b,c,d,e,f,_,=, }Replace char: ?Truncate length: 20

After Replaceabcd_= ? Af????=? ?af? =ab fe addfe

After Merge Blankabcd_= ? Af????=? ?af? =ab fe addfe

After Truncateabcd_= ? Af????=? ?a

Page 7: PPM based Spam Filtering in SEWM2008

Train PPM Model

Use order-6 PPM* model Use Method D Escape estimation Train Two PPM model HAM Model SPAM Model

Page 8: PPM based Spam Filtering in SEWM2008

Model Classification

MCE( Minimum Cross-entropy ) MDL( Minimum Description Length ) Spam Score

Page 9: PPM based Spam Filtering in SEWM2008

Advantage

Simple pre-processing No decode ( avoid obfuscate ) Highly self-adaptive Low false positive

Page 10: PPM based Spam Filtering in SEWM2008

Reference

《 Spam Filtering Using Statistical Data Compression Models 》

《 Unbounded Length Contexts for PPM 》

Page 11: PPM based Spam Filtering in SEWM2008

Question

Delay Index ham, Ham and HAM Active learning 10000

Deliver the filter

Page 12: PPM based Spam Filtering in SEWM2008

Thanks for your attention!Q&A