Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

77
Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25,
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    225
  • download

    1

Transcript of Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

Page 1: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

Detecting Spam Blogs:

An Adaptive Online Approach

Pranam KolariPh.D. Defense, Sept 25, 2007

Page 2: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

THESIS STATEMENT

It is possible to develop an effective, efficient and adaptive system to detect spam blogs.

Page 3: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CONTRIBUTIONS

(i) a principled study of the characteristics of the problem,

(ii) a well motivated feature discovery effort,

(iii) a cost-sensitive, real-time filtering implementation, and

(iv) an ensemble driven classifier co-evolution.

Page 4: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Cost-aware pipeline

• Adaptive Classifiers

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 5: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

WHAT IS SPAM?

• “Unsolicited usually commercial e-mail sent to a large number of addresses” – Merriam Webster Online

• As the Internet has supported new applications, many other forms are common, requiring a much broader definition

Capturing user attention unjustifiably in Internet enabled applications (e-mail,

Web, Social Media etc..)

Page 6: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

DIRECT INDIRECT

E-Mail Spam

General Web Spam

Spam Blogs (Splogs)

SPAM TAXONOMY

IM Spam (SPIM)

Spamdexing

INTERNET SPAM

[Forms]

[Mechanisms]

Social Network Spam

Comment Spam

Bookmark Spam

Social Media Spam

Page 7: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

SPAMDEXING

Spam pages,Spam Blogs,Spam Comments,Guestbook SpamWiki Spam

SERP

Search Engines

Affiliate ProgramsContext Ads

ads/affiliate linksarbitrage

in-links

spamdex

JavaScript Redirect

Affiliate Program Buyers

Spam pages,Spam Blogs[DOORWAY]

Spammer owneddomains

(i)

(ii)

(iii)

Page 8: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

SPAM BLOG

Auto-generated and/or Plagiarized Content

Advertisements inProfitable Contexts

Link Farms to promoteother spam pages

Page 9: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Cost-aware pipeline

• Adaptive Classifiers

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 10: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CONTRIBUTIONS

(i) a principled study of the characteristics of the problem,

(ii) a well motivated feature discovery effort,

(iii) a cost-sensitive, real-time filtering implementation, and

(iv) an ensemble driven classifier co-evolution.

Page 11: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• WorldNet defines characterize as “to describe or portray the characters or the qualities or peculiarities”

• Our efforts– Define and Scope the Problem– Field Study– Principled Empirical Analysis– Publicize and solicit feedback

CHARACTERIZATION

Page 12: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

Update Pings

Update Pings

Ping Stream

1

2

Fetch Content 3

Splog Filtering between steps 2 and 3 (Pre-indexing) , used by blog harvester

SCOPE

Page 13: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Bias of Search engines to blogs– through quick indexing (ping servers)– and higher relevance (temporal)

• Availability of third party blogging platforms – providing service for free– supporting programmatic content injection– enjoying high authority and trust (e.g. blogspot)– enabling obfuscation (doorways) to search

engines and DMCA notices

BLOGS & SPAMDEXING

Page 14: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

56% of all active blogs are splogs! (2007)

SPLOGS BY NUMBERS

• 75% of update pings (eBiquity 2006)• 20% of indexed Blogosphere (Umbria 2006)• 56% of update pings (eBiquity 2007)

Page 15: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Given a blog, is it authentic or spam?

• Explore evidence space– Contents of the Blogs (Local Attributes)

– Evidence from Neighbors (Global Attributes)

SPLOG DETECTION PROBLEM

P(splog(x)/ O(x))

P(splog(x)/ L(x))

Page 16: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

EXISTING CONTEXTSE-MAIL WEB BLOGS

time time/posts

• Image Spam, • Character Salad

• Scripts, Doorways • Scripts, Doorways• Temporal Deception

• Users• E-mail Service Provider

• Search Engines• Page Hosting Services (e.g. Tripod)

• Web Search Engines• Blog Search Engines• Blog Hosting Services• (Ping Servers)

• Fast Detection• Low Overhead• Online

• Batch Detection• Mostly Offline

• Fast Detection• Low Overhead

NATURE

WHO USES IT?

CONSTRAINTS

ATTACKS

Page 17: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Local Content (Drost et al, 2005)– using TFIDF word-features, specialized

features etc. • Statistical Properties (Fetterly et al, 2004)

– using page updates, identical pages through page-stitching

• Trust-Rank (Gyongi et al, 2004)– As an extension to Page-Rank

• Splog Detection (Salvetti et al, Lin et al)

RELATED WORK – WEB SPAM

Page 18: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Cost-aware pipeline

• Adaptive Classifiers

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 19: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CONTRIBUTIONS

(i) a principled study of the characteristics of the problem,

(ii) a well motivated feature discovery effort,

(iii) a cost-sensitive, real-time filtering implementation, and

(iv) an ensemble driven classifier co-evolution.

Page 20: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Document as vectors in a feature space

• Feature Space– Discovery– Representation– Selection

• Classification Techniques– Support Vector Machines (Discriminative)– Naïve Bayes Classifier (Generative)

• Tools (libsvm, weka)

MACHINE LEARNING CLASSIFICATION

f1, f2, f3 .. fm

Page 21: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Precision (P)– a measure of correctness of classified

documents

• Recall (R)– a measure of completeness of classified

documents

• F-1 = 2*P*R/(P+R) • ROC AUC* – Area Under the Curve

– a measure of discriminatory power

MACHINE LEARNING EVALUATION

* Presented in Thesis Document

Page 22: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• SPLOG-2005– Sampled Summer 2005 at Technorati– Labeled samples of 700 blogs and 700 splogs– Only Blog-homepages

• SPLOG-2006– Sampled Oct 2006 at Weblogs.com– Labeled samples of 750 blogs and 750 splogs– Blog-homepages + feeds

DATASETS

Page 23: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

EXPERIMENTAL SETUP

• Binary feature encoding

• Top 50K selected using frequency count

• SVMs– Default parameters– Linear Kernel

• No stemming or stop word elimination

• Naïve Bayes

• Ten fold cross-validation

Page 24: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

URL

2005 2006

Page 25: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

URL• 3,4,5 charactergrams from URL• Captures profitable contexts • Highly effective at ping streams• Supports an extremely low cost classifier

2005 2006

Page 26: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

WORDS

2005 2006

Page 27: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

WORDS

2005 2006

• Words (Text) on a Blog• Previously effective in topic classification• Captures profitable advertising contexts• Interesting Authentic Genre Observed

Page 28: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

WORDGRAMS

2005 2006

Page 29: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

WORDGRAMS

2005 2006

• Word-2-grams, 2 adjacent words• Shallow NLP technique to tackle word salad• Word salad less common in web spam (TFIDF)• Word-x-gram features, exponential with x

Page 30: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CHARACTERGRAMS

2005 2006

Page 31: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CHARACTERGRAMS

2005 2006

• 3,4,5 charactergrams from blog content• Can capture character salad (e.g. p1lls)• Feature selection important

Page 32: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

OUTLINKS

2005 2006

Page 33: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

OUTLINKS

2005 2006

• Out-links tokenized by non-alphabets• Similar to URL n-grams, likely more robust

• Novel feature space

Page 34: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ANCHORS

2005 2006

Page 35: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ANCHORS

2005 2006

• Anchor text tokenized into words• Subsumed by words, but obfuscation difficult• Capture personalization of publishing template• Novel feature space

Page 36: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

“Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…”

“Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!”

“Holy Grail Of Advertising... “

“Easily Dominate Any Market, AnySearch Engine, Any Keyword.”

Splog software ?!

$ 197

Page 37: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

Capture HTML Stylistic Patterns in Authentic Blogs

Page 38: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

HTMLTAGS

2005 2006

Page 39: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

HTMLTAGS

2005 2006

• Use HTML Tags – stylistic information• Capture signatures of splog software• Fully language independent• Novel feature space

Page 40: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

FEED BASED DETECTION

• Limitations using only home-pages– No knowledge of blog lifetime– Classifiers less effective in early lifecycle

• Benefits of using feeds– Most recent posts, lifetime, metadata– Capture correlations across posts

• Limitations of using only feeds– Loose out signatures in publishing template

Page 41: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

FEED ITEM DISTRIBUTON

• Plot number of items in feeds (SPLOG-2006)• Authentic Blogs feature normal distribution• Splogs – many with just one post• Knowledge of classifier effectiveness vs. lifetime

Page 42: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

FEED BASED DETECTION

• Disjoint feature spaces – Words, Tags• Trained and Tested with n (x-axis) posts• Publishing template signatures important• Tags much more effective – early lifecycle

Page 43: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

RELATED CLASSIFIERS

• Blog Identification– Competency requirement for blog harvesters– F-1 measure of 98%

• Relational Features– Less Effective (High P, Low R)– Short-lived blogs, lifetime dependent– Knowledge of Web-graph

• Derived Features– Less Effective

Page 44: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

FEATURE SPACE OBSERVATIONS

• Cost based classifier bucketing

• Known Feature Spaces– Words continue to be effective– Word-grams against obfuscation

• Novel Feature Spaces– Out-links, Anchors capture useful signals– HTML Tags very effective, even early lifecycle

• Feature Space Exploration– Tags, JavaScript, Feed Classification

Page 45: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Cost-aware pipeline

• Adaptive Classifiers

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 46: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CONTRIBUTIONS

(i) a principled study of the characteristics of the problem,

(ii) a well motivated feature discovery effort,

(iii) a cost-sensitive, real-time filtering implementation, and

(iv) an ensemble driven classifier co-evolution.

Page 47: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

META-PING SYSTEM

• Regular Expression Filtering (March 2005)

• List of Authentic Blogs (August 2005)

• Blog Home-page Classifier (December 2005)

• URL Classifier (October 2006)

• Feed Classifier (May 2007)

• Cost-Aware Pipeline Implementation (Jan 2007)

Page 48: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

BLOG IDENTIFIER

BLOG IDENTIFIER

LANGUAGEIDENTIFIERLANGUAGEIDENTIFIER

PINGLOGPINGLOG

PRE-INDEXING SPING FILTER

REGULAREXPRESSIONS

REGULAREXPRESSIONS

BLACKLISTSWHITELISTSBLACKLISTSWHITELISTS

URLFILTERS

URLFILTERS

HOMEPAGEFILTERS

HOMEPAGEFILTERS

FEEDFILTERS

FEEDFILTERS

AUTHENTIC BLOGSAUTHENTIC BLOGS

IP BLACKLISTSIP BLACKLISTS

Ping Stream

Ping Stream

Ping Stream

META-PING SYSTEM

Increasing Cost

Page 49: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

META-PING SYSTEM

• Static Design– Project specific thresholds– Classifiers in pipeline– Based on accrued domain knowledge

• Dynamic Possibilities– Classifier Thresholds– Classifier use– Queuing analysis and Precision/Recall

requirements

Page 50: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Cost-aware pipeline

• Adaptive Classifiers

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 51: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CONTRIBUTIONS

(i) a principled study of the characteristics of the problem,

(ii) a well motivated feature discovery effort,

(iii) a cost-sensitive, real-time filtering implementation, and

(iv) an ensemble driven classifier co-evolution.

Page 52: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Change in distribution in feature space

• Concept Drift – Seasonal, seen in both splogs and blogs

• Adversarial Scenario – seen in splogs

• Concept Description needs to be updated

ADAPTIVE CONTEXT

f1, f2, f3 .. fm

P(O(x)/splog(x))

P(splog(x)/O(x))

Page 53: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ENSEMBLE INTUITION

• Stream of unlabeled instances (drifting)

• Base classifiers with potentially independent feature spaces

• Is an ensemble (probabilistic committee) of the catalogue more robust to drift?

• Are instances classified by the ensemble effective to retrain base classifiers (semi-supervised learning)?

• Motivated by co-training

Page 54: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ENSEMBLE INTUITION

base classifiers updated classifiersensemble committee

(probabilistic)

classify

classifyclassify

retrain

unlabeled

instances

Page 55: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ENSEMBLE APPROACH

ensemble

committee

probabilistic base

classifiers

Page 56: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

POTENTIAL TO ADAPT

URL

Anchor

Chargram

Outlink

Tag

Wordgrams

Words

Page 57: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

EXPERIMENTAL SETUP

• A catalog of seven classifiers

• SPLOG-2005 as base labeled dataset

• SPLOG-2006 as evaluation stream

• 10K Top Features

• SVM based learning

• SPLOG-2006 separated out into unlabeled stream and test set (3-fold)

• F-1 performance metric evaluation

Page 58: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

RESULTS – WORD DRIFT

Page 59: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

RESULTS – ALL CLASSIFIERS

Page 60: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ENSEMBLE PROPERTIES

• Effectiveness tied to properties of ensemble

• Precision 92%, Recall 93%– 5 points over best base classifier

• Ensemble Diversity– a measure of disagreement between base

classifier– maintain error of base classifiers

Page 61: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ENSEMBLE PROPERTIES

• Different metrics for diversity

• Q-statistic compares pairs of classifiers in an ensemble, [-1, +1]

• -1 most diverse, +1 least diverse

N11N00 - N01N10

N11N00 + N01N10Q =

N11 – Both classify correctly

N00 – Both misclassify

N10 – Misclassification by 2nd

N01 – Misclassification by 1st

Page 62: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ENSEMBLE PROPERTIES

cgram wgram word tag outlink anchor url

cgram 1.0 0.67 0.86 -0.23 0.35 0.58 0.08

wgram 0.67 1.0 0.77 -0.08 0.62 0.56 0.11

word 0.86 0.77 1.0 -0.19 0.53 0.76 0.04

tag -0.23 -0.08 -0.19 1.0 0.15 -0.12 0.24

outlink 0.35 0.62 0.53 0.15 1.0 0.45 0.10

anchor 0.58 0.56 0.76 -0.12 0.45 1.0 0.03

url 0.08 0.11 0.04 0.24 0.10 0.03 1.0

Page 63: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

ADAPTIVE - OBSERVATIONS

• Interplay between co-training and adversarial classification

• Maintaining and exploiting (ensemble) a catalogue of features effective

• Unlike existing work in concept drift– Real-world data– Stream of unlabeled instances

• Novel “feature spaces” key to adaptive, adversary resilient classifiers

Page 64: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Adaptive Classifiers

• Cost-aware pipeline

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 65: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

THESIS STATEMENT

It is possible to develop an effective, efficient and adaptive system to detect spam blogs.

Page 66: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

BLOG IDENTIFIER

BLOG IDENTIFIER

LANGUAGEIDENTIFIERLANGUAGEIDENTIFIER

PINGLOGPINGLOG

PRE-INDEXING SPING FILTER

REGULAREXPRESSIONS

REGULAREXPRESSIONS

BLACKLISTSWHITELISTSBLACKLISTSWHITELISTS

URLFILTERS

URLFILTERS

HOMEPAGEFILTERS

HOMEPAGEFILTERS

FEEDFILTERS

FEEDFILTERS

AUTHENTIC BLOGSAUTHENTIC BLOGS

IP BLACKLISTSIP BLACKLISTS

Ping Stream

Ping Stream

Ping Stream

META-PING SYSTEM

Increasing Cost

Page 67: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

META-PING IMPLEMENTATION

• Multithreaded, distributed Java implementation

• Regular Expressions, accrued over two years, tested using white-lists

• Blacklists - IP Address from known domain, learnt using higher cost classifiers

• libsvm toolkit for probabilistic classifiers

• Project specific classifier choices and thresholds

Page 68: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

META-PING EVALUATON

• Effective sub-modules– Evaluation of effective features– Harvard (Blog Identification, Word-based classifier),

UMich (shared results)– ********, *******, LMCO

• Efficient solution– Pipeline deployment at UMBC (January 2007)– Ping filtering for two months (3 machines – 40 threads)

• Adaptive ready (offline)– Evaluation using year apart real-world datasets

Page 69: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Adaptive Classifiers

• Cost-aware pipeline

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 70: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

THESIS STATEMENT

It is possible to develop an effective, efficient and adaptive system to detect spam blogs.

Page 71: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CONCLUSIONS

• Characterizing the Problem of Spam Blogs– Helps Drive Solutions– Readies tackling new emerging problems

(e.g. Social Media spam)

• Feature Spaces effective for text classification are also useful here

• New feature spaces are quite effective, and could potentially be useful in other domains

Page 72: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

CONCLUSIONS

• Using classifier costs to drive a pipeline based implementation can lead to an efficient filtering solution

• Semi-supervised ensemble approach can enable adaptive classifiers– Could be useful in domains (adversarial)

that use a catalogue of classifiers– Proactive techniques are feasible for web

spam detection

Page 73: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

• Introduction

• Characterization

• Feature Discovery

• Adaptive Classifiers

• Cost-aware pipeline

• Evaluation

• Conclusions

• Future Directions

OUTLINE

Page 74: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

DIRECT INDIRECT

E-Mail Spam

General Web Spam

Spam Blogs (Splogs)

SOCIAL MEDIA SPAM

IM Spam (SPIM)

Spamdexing

INTERNET SPAM

[Forms]

[Mechanisms]

Social Network Spam

Comment Spam

Bookmark Spam

Social Media Spam

Page 75: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

SOCIAL MEDIA SPAM

• Spam in social “microcosms” on the Web

• Spam on the Web– Spamdexing– Social Media Spam

• Social Media Spam serves two purposes– Local effects initially– Global effects subsequently (spamdexing)

• Detection efforts should address deployment contexts (microcosm, search)

Page 76: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

OPEN PROBLEMS

• Feature Sophistication in new feature spaces, HTML Tags, JavaScript, Feeds

• Cost-aware pipeline (dynamic)

• Adversarial Classification, interplay with concept drift, catalog of features

• Active Learning and Adversarial Classification in the “catalogue” context

• Social Media Spam

Page 77: Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari Ph.D. Defense, Sept 25, 2007.

THANK YOU!