BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi,...

29
BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL)

Transcript of BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi,...

Page 1: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

BlogVox: Separating Blog Wheat from Blog Chaff

BlogVox: Separating Blog Wheat from Blog Chaff

Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC)

James Mayfield (JHU/APL)

Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC)

James Mayfield (JHU/APL)

Page 2: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Motivation: Cleaning the HarvestMotivation: Cleaning the Harvest• BlogVox – A Blog analytics engine developed for

the TREC 2006 Blog Track.• Presence of spam blogs or splogs and extraneous

content waters down the quality of the index.• Narrowing down on the content of the post is

essential in lack of clearly demarcated opinion sentences (like in eopinions, IMDB, Amazon etc)

• Noisy and unstructured text on the Blogosphere can skew blog analytics/ business intelligence tools (as observed in TREC 2006).

Page 3: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

BlogVox Opinion Extraction System

BlogVox Opinion Extraction System

• TREC 06: Finding opinionated posts, either positive or negative, about a query

• 2006 TREC Blog corpus:• 80K blogs• 300K posts• 50 test queries

• BlogVox opinion extraction system• Document and sentence

level scorers• Combined scores using an

SVM meta-learner• Data cleaning: splogs and

post identification

BlogVox

BlogVox challenges• Data cleaning and splog removal • Slangs• Semantic orientation of words• Contradictions, sarcasms, ungrammatical text

Page 4: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Separating Blog Wheat from Blog ChaffSeparating Blog Wheat from Blog Chaff

Data cleaning for• Splog removal • Post content identification

Non English Blog removalNon English Blog removal

2

Collection ParsingCollection Parsing

1

Splog DetectionSplog Detection

3

Pre Indexing Steps

Title and Content Extraction

Title and Content Extraction

4

Non English Blog removalNon English Blog removal

2

Collection ParsingCollection Parsing

1

Splog DetectionSplog Detection

3

Pre Indexing Steps

Title and Content Extraction

Title and Content Extraction

4

Page 5: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Spam in the BlogosphereSpam in the Blogosphere• Types: comment spam, ping spam, splogs• Akismet: “87% of all comments are spam”• 75% of update pings are spam (ebiquity

2005)• 56% of blogs are spam (ebiquity 2005)• 20% of indexed blogs by popular blog search

engines is spam (Umbria 2006, ebiquity 2005)

• Spam blogs (splogs) are weblogs used to promoting affiliated websites or host ads

• “Spings, or ping spam, are pings that are sent from spam blogs”

Page 6: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Motivation: host adsMotivation: host ads

Page 7: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Motivation: index affiliates, promote pageRank

Motivation: index affiliates, promote pageRank

Page 8: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Data Cleaning: SplogsData Cleaning: Splogs

• Splog detection using SVM• 700 blogs, 700 splogs used for

training• Model based on blog homepage

and local blog features

Host Ads Index affiliates,Promote

pageRank

Plagiarized content

Splog Detection Performance

Page 9: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Nature of Splogs in TREC 2006Nature of Splogs in TREC 2006

1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis

• Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks

• 81K blogs could be processed • We use splog detection models developed on

blog home-pages; 87% accuracy• We identified 13,542 splogs• Blacklisted 543K permalinks from these

splogs• ~16% of the entire collection• ~17% splog posts injected into TREC dataset1

1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis

Page 10: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Impact of Splogs in TREC Queries

Impact of Splogs in TREC Queries

Distribution of Splogs that appear TREC queries

0

20

40

60

80

100

120

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Top 100 Search Results ranked using TFIDF Scoring

Num

ber

of Splo

gs

851

852

853

854

855

856

857

858

859

860

861

862

863

864

865

866

867

868

869

870

871

872

873

874

875

876

877

878

879

880

881

882

883

884

885

886

887

888

889

890

891

892

893

894

895

896

897

898

American Idol

CholesterolHybrid Cars

Page 11: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Higher in Spam Prone ContextsHigher in Spam Prone ContextsSplog Distribution for 'Spam Terms'

0

20

40

60

80

100

120

5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Search Result Rank

Num

ber

of Splo

gs

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

Spam query terms based on analysis by McDonald et al 2006 ..

Card

Interest

Mortgage

Page 12: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Separating Blog Wheat from Blog ChaffSeparating Blog Wheat from Blog Chaff

Data cleaning for• Splog removal• Post content identification

Non English Blog removalNon English Blog removal

2

Collection ParsingCollection Parsing

1

Splog DetectionSplog Detection

3

Pre Indexing Steps

Title and Content Extraction

Title and Content Extraction

4

Non English Blog removalNon English Blog removal

2

Collection ParsingCollection Parsing

1

Splog DetectionSplog Detection

3

Pre Indexing Steps

Title and Content Extraction

Title and Content Extraction

4

Page 13: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Data Cleaning: Content Identification

Data Cleaning: Content Identification

Navigation

Post content

Ads

Recent Posts

Page 14: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Data cleaning: Baseline heuristicData cleaning: Baseline heuristic

Eliminate link a if there exist a link b

• Within θ distance• No Title tags between the links• Avg length of text bearing nodes

less than a threshold• b is the nearest link to a An example DOM tree

Navigational Links

Ads

Post Content

Sidebar

Page 15: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Data cleaning: SVM cleanerData cleaning: SVM cleaner

• Random collection of 150 blog posts

• Human evaluation of 400 links tagged as content or extraneous links

• We trained SVM using linear kernel in this analysis

DOM Features

Evaluation

Tag Features

Position Features

Word Features

Page 16: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Data Cleaning: Effect of sidebar content

Data Cleaning: Effect of sidebar content

Page 17: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Related WorkRelated Work• Web Spam Detection

• Coverage: Blog Analytics Engines don’t look beyond Blogosphere

• Speed of detection is important, 150K posts/hour

• RSS feeds presents new opportunities, and challenges

• Email spam Detection• Nature of spamming: links,

RSS feeds, web graph, metadata

• Users targeted indirectly through search engines, e.g. “N1ST” not relevant for “NIST” query

• Template Detection• Repeated structural components

detected via sampling• Customization, use of javascripts

and AJAX is increasing• Simple heuristics using DOM

traversal work well in general cases

• Sentiment Analysis• Open domain opinion extraction

is complex• Opinions are part of a narrative• Subject for which the opinion is

being expressed is not easy to detect

Page 18: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

ConclusionsConclusions

• Noisy content on the Blogosphere present a major challenge to the quality of blog analytics tools.

• Combination of heuristics and ML can be used to effectively clean the data.

Ongoing Work• DOM subtree elimination• Identifying the subject of the opinion• Slangs• More training examples!

Page 19: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

http://ebiquity.umbc.edu/http://ebiquity.umbc.edu/

Thank you!

Page 20: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Backup Slides

Page 21: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Opinions in Social MediaOpinions in Social Media “I went to school early so I

would have time to grab some lunch. Which ended up consisting of a crappy sandwich from starbucks and a chai latte. Lacey came into Starbucks while I was there so we chatted for a little bit and she thought that I might be in her class. After I finished eating I headed to school and checked the board……..”1

[1] http://annamay13x.livejournal.com/7061.html

Expressed Opinions

NarrativeReader’s Perspective

“Starbucks Sandwiches are bad!”

Opinions can influence buying decisions of customers

Page 22: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Keyword Stuffed Blog• ‘coupon codes’, ‘casino’

Page 23: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Post Stitching• Excerpts scraped from other sources

Page 24: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Post Weaving• Spam Links contextually placed in post

Page 25: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

Link-roll spam• With fully plagiarized text

Page 26: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

DifficultyDifficulty• We have been experimenting

with multiple approaches starting mid 2005

• Data: http://ebiquity.umbc.edu/resource/html/id/212

Page 27: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

DifficultyDifficulty

• Evolving spamming techniques and splog creation genres • Most basic technique spam techniques

• Generate content by stuffing key dictionary words• Generate link to affiliates, through link dumps on

blogrolls, linkrolls or after post content• Evolving spam techniques

• Scrape contextually similar content to generate posts• RSS hijacking• Aggregation software, e.g. Planet X• Intersperse links randomly• Make link placement meaningful• Add spam comments and then ping. Repeat.

Page 28: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

TREC Submissions (Topic Relevance)TREC Submissions (Topic Relevance)

Page 29: BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL) Akshay.

TREC Submissions (Opinion Extraction)

TREC Submissions (Opinion Extraction)