7/31/2019 Sadia Masood 2
1/53
Web Spam
Seminar: Future Of Web Search
Know Your Neighbors: Web Spam
Detection using the Web Topology
Presenter: Sadia Masood
Tutor : Klaus Berberich
Date : 17-Jan-2008
University of Saarland
7/31/2019 Sadia Masood 2
2/53
Web Spam
The Agenda
Focus of the First Paper
Motivation
The Objective
DATASET
Link Based Features
Content Based Features
Using the Content & Link Based Features
Using the Web Topology
Conclusion
2
7/31/2019 Sadia Masood 2
3/53
Web Spam
The Agenda
Focus of the First Paper
Motivation
The Objective
DATASET
Link Based Features
Content Based Features
Using the Content & Link Based Features
Using the Web Topology
Conclusion
3
7/31/2019 Sadia Masood 2
4/53
7/31/2019 Sadia Masood 2
5/53
What is Web Spamming?
Any deliberate human action that is meant to trigger an
unjustifiably favorable relevance or importance of some web page
considering the pages true value [5]
Also called Search Engine SpammingorSpamdexing
[5]: Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International Workshop on
Adversarial Information Retrieval on the Web, 2005.5
7/31/2019 Sadia Masood 2
6/53
What is Web Spamming?
Example
Query: Kaiser Pharmacy (at the time paper was written 2005)
Result: had techdictionary.com as a 3rd hit
6
7/31/2019 Sadia Masood 2
7/53
Why is Web Spamming Bad?
Search Engines suffer because
It damages the search engines reputation
there is a the cost involved to crawling, indexing, storing the spam
pages.
Users suffer because
precision in the query results is lowered
7
7/31/2019 Sadia Masood 2
8/53
How is it done?
Web Spamming Techniques involve
Boosting Techniques
Content spam (e.g, Term repetition)
Link spam (e.g Link farming)
Hiding Techniques
For example, Content hiding, Cloaking
8
7/31/2019 Sadia Masood 2
9/53
What do the Search Engines Ultimately want ?
Want to calculate and return the exact page ranks based on
relevance and importance
Want to avoid the spam pages altogether before they use
resources that might be used in storing or processing those webpages
9
7/31/2019 Sadia Masood 2
10/53
Web Spam
The Agenda
Focus of the First Paper
Motivation
The Objective
DATASET
Link Based Features
Content Based Features
Using the Content & Link Based Features
Using the Web Topology
Conclusion
10
7/31/2019 Sadia Masood 2
11/53
7/31/2019 Sadia Masood 2
12/53
The Flow
Feature extraction
Combining link & content based features
Link based
Content based
Classification
Clustering Propagation Stacked graphical learning
Smoothing
12
7/31/2019 Sadia Masood 2
13/53
Web Spam
The Agenda
Focus of the First Paper
Motivation
The Objective
DATASET
Link Based Features
Content Based Features
Using the Content & Link Based Features
Using the Web Topology
Conclusion
13
7/31/2019 Sadia Masood 2
14/53
Web Spam
Data Set
Data Set
Collected in May 2006
Publicly available WEBSPAM-UK2006 dataset [15]
Pages from .uk domain
77.9 million pages were collected corresponding to 11,400 hosts
A group of volunteers labeled them normal, borderline or spam
6,552 hosts evaluated
[15] http://www.yr-bcn.es/webspam/datasets/14
7/31/2019 Sadia Masood 2
15/53
Data Set
Distribution of host labels, as judged by human volunteers
Evaluation
True positive rate, or Recall R False positive rate
F-measure
15
7/31/2019 Sadia Masood 2
16/53
Web Spam
The Agenda
Focus of the First Paper
Motivation
The Objective DATASET
Link Based Features
Content Based Features
Using the Content & Link Based Features
Using the Web Topology
Conclusion
16
7/31/2019 Sadia Masood 2
17/53
Link Based Features
Degree related measures
PageRank
TrustRank Truncated PageRank
Estimation of d supporters
140 features per host
17
7/31/2019 Sadia Masood 2
18/53
Link Based Features
Page Rank
Uses link structure to determine importance or popularity of a web page
Intuition:
A web page is important if several other important pages point to
it
PageRank of a page influences and is being influenced by
importance of the other pages
18
7/31/2019 Sadia Masood 2
19/53
Link Based Features
Page Rank
Calculation: Initial scores already defined for all the pages
A
B
C
D
[6] http://en.wikipedia.org/wiki/PageRank
A
B
C
D
19
7/31/2019 Sadia Masood 2
20/53
Link Based Features
Page Rank
Page Rank with Random Surfer Model
d is damping factor,
M(pi) is the set of pages that link topi,
L(pj) is the number of outbound links on pagepj,
Nis the total number of pages.
20
Li k B d F t
7/31/2019 Sadia Masood 2
21/53
Link Based Features
Trust Rank
A page having high pageRank is more likely to be spam if it had no
relationship with a set of trusted pages
How it works?
Determine the seed set S (using high PageRank or high out-degree)
Determine good or bad nodes of the set S (using oracle)
E.g, S ={ , , }
where, good =
bad =21
Li k B d F t
7/31/2019 Sadia Masood 2
22/53
Link Based Features
Trust Rank
Image [4]: http://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdf
Calculate & propagate the trust from good pages (adjusting trust
attenuation)
22
Link Based Feat res
7/31/2019 Sadia Masood 2
23/53
Link Based Features
Histogram of ratio b/w TrustRank & PageRank
23
Link Based Features
7/31/2019 Sadia Masood 2
24/53
Link Based Features
Truncated Page Rank
A variant of PageRank, diminishes the influence of a page
to the PageRank score of its close neighbors
spamweb web
24
Link Based Features
7/31/2019 Sadia Masood 2
25/53
Link Based Features
Estimation of d-supporters
x is d-supporter of node y if shortest path from x to y has
length d Nd(x) be the set of d-supporters of page x
For each page x, cardinality of Nd(x) is an increasing fuction
with respect to d.
Image [4]25
Link Based Features
7/31/2019 Sadia Masood 2
26/53
Link Based Features
Bottleneck Number
The bottleneck measure for page x, defined as
indicates the minimum rate of growth of neighbors of x up to a
certain distance
spam pages have smaller bottle neck numbers than non-spam
pages
Bottleneck: Non-Spam & Spam
Image [4] 26
Link Based Features
7/31/2019 Sadia Masood 2
27/53
Link Based Features
Histogram of b4(x) of Spam and Non-spam Pages
27
The Agenda
7/31/2019 Sadia Masood 2
28/53
Web Spam
The Agenda
Focus of the First Paper
Motivation
The Objective
DATASET
Link Based Features
Content Based Features Using the Content & Link Based Features
Using the Web Topology
Conclusion
28
Content Based Features
7/31/2019 Sadia Masood 2
29/53
Content Based Features
Number of words in the page, title & average word length
Fraction of anchor text
Fraction of visible text
Compression rate
Corpus precision & corpus recall
Query precision and query recall Independent trigram likelihood
Entropy of trigrams
96 features per host
29
Content Based Features
7/31/2019 Sadia Masood 2
30/53
Co te t ased eatu es
Average word length
30
Content Based Features
7/31/2019 Sadia Masood 2
31/53
Compression Rate
Compression rate = size of compressed text (visible text)
size of uncompressed text
Precision and Recall
F= Frequent terms in the collection
T= Terms in the page
Q= Frequent terms in the query log
Corpus Recall = | F T | / | F |
Query Recall = | Q T | / | Q |
31
Content Based Features
7/31/2019 Sadia Masood 2
32/53
Corpus Precision = | F T | / | T |
Query Precision = | Q T | / | T |
32
Content Based Features
7/31/2019 Sadia Masood 2
33/53
Entropy of trigrams (Compression)
Calculated on distribution of trigrams
Let
be probability distribution on trigrams of a page
be the set of all trigrams in a page
Entropy of trigrams=
33
The Agenda
7/31/2019 Sadia Masood 2
34/53
Web Spam
Focus of the First Paper
Motivation
The Objective DATASET
Link Based Features
Content Based Features
Using the Link and Content Features
Using the Web Topology
Conclusion
34
Using Link and Content Based Features
7/31/2019 Sadia Masood 2
35/53
Cost Sensitive Decision Tree
Bagging
35
Using Content & Link Based Features
7/31/2019 Sadia Masood 2
36/53
Comparing Link and Content based features
36
The Agenda
7/31/2019 Sadia Masood 2
37/53
Web Spam
Focus of the First Paper
Motivation
The Objective
DATASET
Link Based Features
Content Based Features
Using the Content & Link Based Features
Using the Web Topology
Conclusion
37
Using the Web Topology
7/31/2019 Sadia Masood 2
38/53
Observation :
Similar pages tend to be linked to be linked together more frequently
than dissimilar ones
Similar pages tend to be linked to be linked together more frequentlythan dissimilar ones
38
Using the Web Topology
7/31/2019 Sadia Masood 2
39/53
39
Using the Web Topology -Topological Dependencies of Spam Nodes
7/31/2019 Sadia Masood 2
40/53
Sin
Sout= No. of spam hosts linked by x
All hosts linked by x
Sout
Sin= No. of spam hosts linked to xAll hosts linked to x
40
Using the Web Topology
7/31/2019 Sadia Masood 2
41/53
Clustering
Using METIS graph clustering algorithm
41
Using the Web Topology
7/31/2019 Sadia Masood 2
42/53
Clustering
42
Using the Web Topology
7/31/2019 Sadia Masood 2
43/53
Use the graph topology to smooth predictions by propagating them
as random walks
Main Idea
Use thepredicted spamicityof a particular classification method and
start a random walk with the restart probabilty 1-
Where
h: host
p(h) : [0..1] ( p(h)=0: non-spam , p(h)=1: spam, for each host h)
v(0) : vector such that v(0)h =
outdeg(g): the out-degree of g
Propagation
43
Using the Web Topology- Propagation
7/31/2019 Sadia Masood 2
44/53
Applied with three forms of random walk
on the host graph (forward)
on the transposed host graph (backward)
on the undirected host graph (both)
Observed improvements on out of other tried values
Propagation
44
Using the Web Topology
7/31/2019 Sadia Masood 2
45/53
Results with in the table
Propagation
45
Using the Web Topology
7/31/2019 Sadia Masood 2
46/53
Stacked Graph Learning
Uses initial predictions by the classification scheme C for
all the objects in the data set
Generates a set of extra features for each object based on
relevance ( w.r.t in-links, outlinks or both)
Let r(h) be the set if pages related to host h
Then,
Adds these extra features f(h) to the input of C and run the algorithm
again
46
Using the Web Topology
7/31/2019 Sadia Masood 2
47/53
Stacked Graph Learning
Improvement is seen in the first Iteration
47
Using the Web Topology
7/31/2019 Sadia Masood 2
48/53
Stacked Graph Learning
Second Iteration showed even better results
48
The Agenda
7/31/2019 Sadia Masood 2
49/53
Web Spam
Focus of the First Paper
Motivation
The Objective
DATASET
Link Based Features
Content Based Features Using the Web Topology
Conclusion
49
Conclusion
7/31/2019 Sadia Masood 2
50/53
Experiments have clearly shown that
Combining both the link and content based features bringssignificant improvement to spam detection techniques
Using web graph dependencies , we can detect the web spam by
turning spammers ingenuity against themselves
50
7/31/2019 Sadia Masood 2
51/53
Thank You!
Q&A and Discussion
51
References-1
7/31/2019 Sadia Masood 2
52/53
[1] www.milliondollarhomepage.com
[2] http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdf
[3] http://images.google.com/imgres?
[4] http://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdf
[5] Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In First International
Workshop on Adversarial Information Retrieval on the Web, 2005.
[6] http://en.wikipedia.org/wiki/PageRank
[7] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based
characterization and detection of Web Spam. In AIRWeb, 2006.
[8] http://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdf52
References-2
http://www.milliondollarhomepage.com/http://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdfhttp://images.google.com/imgreshttp://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdfhttp://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdfhttp://infolab.stanford.edu/~zoltan/publications/gyongyi2006link.pdfhttp://www.tejedoresdelweb.com/slides/2007_ojobuscador_madrid_spam.pdfhttp://images.google.com/imgreshttp://www.dcc.uchile.cl/~ccastill/papers/cdgms_2006_know_your_neighbors.pdfhttp://www.milliondollarhomepage.com/7/31/2019 Sadia Masood 2
53/53
[9] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using
rank propagation and probabilistic counting for link-based spam detection.
Technical report, DELIS Dynamically Evolving, Large-Scale Information
Systems, 2006.
[10] http://en.wikipedia.org/wiki/Cross-validation
[11]A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web
pages through content analysis. In Proceedings of the World Wide Web
conference, pages 8392, Edinburgh, Scotland, May 2006.[12] http://www.salford-systems.com/doc/BAGGING_PREDICTORS.pdf
[13] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with
TrustRank. In Proc. of the 30th VLDB Conf., 2004.
[14] M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search
engines. ACM SIGIR Forum, 36(2), 2002.
[15] http://www.yr-bcn.es/webspam/datasets/
53
http://en.wikipedia.org/wiki/Cross-validationhttp://www.salford-systems.com/doc/BAGGING_PREDICTORS.pdfhttp://www.salford-systems.com/doc/BAGGING_PREDICTORS.pdfhttp://en.wikipedia.org/wiki/Cross-validation