supporters Counting 2. Yahoo! Research – Barcelona, Spain...
Transcript of supporters Counting 2. Yahoo! Research – Barcelona, Spain...
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Link-based Spam Detection
Luca Becchetti1, Carlos Castillo1,Debora Donato1,Stefano Leonardi1 and Ricardo Baeza-Yates2
1. Università di Roma “La Sapienza” – Rome, Italy2. Yahoo! Research – Barcelona, Spain and Santiago, Chile
Aug 11th, AirWeb 2006
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation
2 Degree-based measures
3 PageRank
4 TrustRank
5 Truncated PageRank
6 Counting supporters
7 Conclusions
http://www.dis.uniroma1.it/~becchet/http://www.chato.cl/http://www.dis.uniroma1.it/~donato/http://www.dis.uniroma1.it/~leon/http://www.baeza.cl/http://www.uniroma1.it/http://research.yahoo.com/
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
What is on the Web?
Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V!-4-gra + Get rich now now now!!!
Graphic: www.milliondollarhomepage.com
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Web spam (keywords + links)
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Web spam (mostly keywords)
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Search engine?
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Fake search engine
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Problem: “normal” pages that are spam
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Link farms
Single-level farms can be detected by searching groups of nodessharing their out-links [Gibson et al., 2005]
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Motivation
[Fetterly et al., 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages:
“in a number of these distributions, outlier values areassociated with web spam”
Research goal
Statistical analysis of link-based spam
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Metrics
Graph algorithms
Streamed algorithms
Symmetric algorithms
All shortest paths, centrality, betweenness, clustering coefficient...
(Strongly) connected componentsApproximate count of neighborsPageRank, Truncated PageRank, Linear RankHITS, Salsa, TrustRank
Breadth-first and depth-first searchCount of neighbors
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Test collection
U.K. collection
18.5 million pages downloaded from the .UK domain
5,344 hosts manually classified (6% of the hosts)
Classified entire hosts:
V A few hosts are mixed: spam and non-spam pagesX More coverage: sample covers 32% of the pages
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Degree
1 100 100000
0.1
0.2
0.3
0.4
Number of in−links
In−degree δ = 0.35
Normal Spam
(δ = max. difference in C.D.F. plot)
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Degree
1 10 50 1000
0.1
0.2
0.3
Number of out−links
Out−degree δ = 0.28
Normal Spam
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Edge reciprocity
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
Fraction of reciprocal links
Reciprocity of max. PR page δ = 0.35
Normal Spam
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Assortativity
0.001 0.01 0.1 1 10 100 10000
0.1
0.2
0.3
0.4
Degree/Degree ratio of home page
Degree / Degree of neighbors δ = 0.31
Normal Spam
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Automatic classifier
All of the following attributes in the home page and the pagewith the maximum PageRank, plus a binary variableindicating if they are the same page:
- In-degree, out-degree- Fraction of reciprocal edges- Degree divided by degree of direct neighbors- Average and sum of in-degree of out-neighbors- Average and sum of out-degree of in-neighbors
Decision tree
72.6% of detection rate, with 3.1% false positives
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
PageRank
Let PN×N be the normalized link matrix of a graph
Row-normalized
No “sinks”
Definition (PageRank)
Stationary state of:
αP +(1− α)
N1N×N
Follow links with probability α
Random jump with probability 1− α
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Maximum PageRank in the Host
1e−08 1e−07 1e−06 1e−05 0.00010
0.1
0.2
0.3
PageRank
Maximum PageRank of the site δ = 0.23
Normal Spam
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Variance of PageRank
Suggested in [Benczúr et al., 2005]
PageRank PageRank
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Variance of PageRank of in-neighbors
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
σ2 of the logarithm of PageRank
Stdev. of PR of Neighbors (Home) δ = 0.41
Normal Spam
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Automatic classifier
Features: degree-based plus the following in the home pageand the page with maximum PageRank:
- PageRank- In-degree/PageRank- Out-degree/PageRank- Standard deviation of PageRank of in-neighbors = σ2
- σ2/PageRank
Plus the PageRank of the home page divided by thePageRank of the page with the maximum PageRank.
Decision tree
74.4% of detection rate, with 2.6% false positives(Degree-based: 72.6% of detection, 3.1% false positives)
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
TrustRank
TrustRank [Gyöngyi et al., 2004]
A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious
Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1− α ateach step
i Trusted nodes: data from http://www.dmoz.org/
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
TrustRank score
1e−06 0.0010
0.1
0.2
0.3
0.4
TrustRank
TrustRank score of home page δ = 0.59
Normal Spam
http://www.dmoz.org/
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
TrustRank / PageRank
0.3 1 10 1000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
TrustRank score/PageRank
Estimated relative non−spam mass δ = 0.59
Normal Spam
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Automatic classifier
PageRank attributes, plus the following in the home page andthe page with maximum PageRank:
- TrustScore- TrustScore/PageRank (estimated relative non-spam mass)- TrustScore/In-degree
Plus the TrustScore in the home page divided by theTrustScore in the page with the maximum PageRank.
Decision tree
77.3% of detection rate, with 3.0% false positives(PageRank-based: 74.4% of detection, 2.6% false positives)
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Path-based formula for PageRank
Given a path p = 〈x1, x2, . . . , xt〉 of length t = |p|
branching(p) =1
d1d2 · · · dt−1where di are the out-degrees of the members of the path
Explicit formula for PageRank [Newman et al., 2001]
ri (α) =∑
p∈Path(−,i)
(1− α)α|p|
Nbranching(p)
Path(−, i) are incoming paths in node i
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
General functional ranking
In general:
ri (α) =∑
p∈Path(−,i)
damping(|p|)N
branching(p)
There are many choices for damping(|p|), including a simplelinear function that is as good as PageRank in practice[Baeza-Yates et al., 2006]
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Truncated PageRank
Reduce the direct contribution of the first levels of links:
damping(t) =
{0 t ≤ TCαt t > T
V No extra reading of the graph after PageRank
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Truncated PageRank(T=2) / PageRank
0.2 0.4 0.6 0.8 1 1.2 1.4 1.60
0.1
0.2
0.3
TruncatedPageRank(T=2) / PageRank
TruncatedPageRank T=2 / PageRank δ = 0.30
Normal Spam
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Max. change of Truncated PageRank
0.85 0.9 0.95 1 1.05 1.10
0.1
0.2
max(TrPRi+1
/TrPri)
Maximum change of Truncated PageRank δ = 0.29
Normal Spam
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Automatic classifier
PageRank attributes, plus the following in the home page andthe page with maximum PageRank:
- TruncPageRank(T = 1 . . . 4)- TruncPageRank(T = 4) / TruncPageRank(T = 3)- TruncPageRank(T = 3) / TruncPageRank(T = 2)- TruncPageRank(T = 2) / TruncPageRank(T = 1)- TruncPageRank(T = 1 . . . 4) / PageRank- Minimum, maximum and average of:TruncPageRank(T = i)/TruncPageRank(T = i − 1)Plus the TruncatedPageRank(T = 1 . . . 4) of the home pagedivided by the same value in the page with the maximumPageRank.
Decision tree
76.9% of detection rate, with 2.5% false positives(TrustRank-based: 77.3% of detection, 3.0% false positives)
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Idea: count “supporters” at different distances
Number of different nodes at a given distance:
.UK 18 mill. nodes .EU.INT 860,000 nodes
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Freq
uenc
y
Distance
0.0
0.1
0.2
0.3
5 10 15 20 25 30
Freq
uenc
y
Distance
Average distance Average distance14.9 clicks 10.0 clicks
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
High and low-ranked pages are different
1 5 10 15 200
2
4
6
8
10
12
x 104
Distance
Num
ber o
f Nod
es
Top 0%−10%Top 40%−50%Top 60%−70%
Areas below the curves are equal if we are in the samestrongly-connected component
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Probabilistic counting
100010
100010
110000
000110
000011
100010
100011
111100111111
100011
Count bits setto estimatesupporters
Target page
Propagation ofbits using the
“OR” operation
100010
Improvement of ANF algorithm [Palmer et al., 2002] based onprobabilistic counting [Flajolet and Martin, 1985]
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
General algorithm
Require: N: number of nodes, d: distance, k: bits1: for node : 1 . . . N, bit: 1 . . . k do2: INIT(node,bit)3: end for4: for distance : 1 . . . d do {Iteration step}5: Aux ← 0k6: for src : 1 . . . N do {Follow links in the graph}7: for all links from src to dest do8: Aux[dest] ← Aux[dest] OR V[src,·]9: end for
10: end for11: V ← Aux12: end for13: for node: 1 . . .N do {Estimate supporters}14: Supporters[node] ← ESTIMATE( V[node,·] )15: end for16: return Supporters
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Our estimator
Initialize all bits to one with probability �
Estimator: neighbors(node) = log(1−�)
(1− ones(node)k
)Adaptive estimation
Repeat the above process for � = 1/2, 1/4, 1/8, . . . , and lookfor the transitions from more than (1− 1/e)k ones to lessthan (1− 1/e)k ones.
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Convergence
5 10 15 200%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Iteration
Frac
tion
of n
odes
with
est
imat
es
d=1d=2d=3d=4d=5d=6d=7d=8
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Error rate
1 2 3 4 5 6 7 80
0.1
0.2
0.3
0.4
0.5
512 b×i 768 b×i 832 b×i 960 b×i 1152 b×i 1216 b×i 1344 b×i 1408 b×i
576 b×i1152 b×i
512 b×i768 b×i
832 b×i
960 b×i
1152 b×i1216 b×i
1344 b×i 1408 b×i
Distance
Ave
rage
Rel
ativ
e E
rror
Ours 64 bits, epsilon−only estimatorOurs 64 bits, combined estimatorANF 24 bits × 24 iterations (576 b×i)ANF 24 bits × 48 iterations (1152 b×i)
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Hosts at distance 4
1 100 10000
0.1
0.2
0.3
0.4
S4 − S
3
Hosts at Distance Exactly 4 δ = 0.39
Normal Spam
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Minimum change of supporters
1 5 100
0.1
0.2
0.3
0.4
min(S2/S
1, S
3/S
2, S
4/S
3)
Minimum change of supporters δ = 0.39
Normal Spam
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Automatic classifier
PageRank attributes, plus the following in the home page andthe page with maximum PageRank:
- Supporters at 2 . . . 4- Supporters at 2 . . . 4 / PageRank- Supporters at i / Supporters at i − 1 (for i = 1..4)- Minimum, maximum and average of: Supporters at i /Supporters at i − 1 (for i = 1..4)- (Supporters at i - Supporters at i − 1) / PageRankPlus the number of supporters at distance 2 . . . 4 in the homepage divided by the same feature in the page with themaximum PageRank.
Decision tree
78.9% of detection rate, with 2.5% false positives(TruncPR: 76.9% of detection, 2.5% false positives)
(TrustRank: 77.7% of detection, 3.0% false positives)
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Summary of classifiers
Detection FalseMetrics rate positives
Degree (D) 72.6% 3.1%D + PageRank (P) 74.4% 2.6%D + P + TrustRank 77.7% 3.0%D + P + Trunc. PageRank 76.9% 2.5%D + P + Est. of Supporters 78.9% 2.5%
All attributes 81.4% 2.8%All attributes (more rules) 80.8% 1.1%
Comparison
Content-based analysis [Ntoulas et al., 2006] has shown86.2% detection rate with 2.2% false positives
Classifier based on TrustRank [Gyöngyi et al., 2004]:49%-50% of detection rate with 2.3%-2.1% error in oursample – SpamRank [Benczúr et al., 2005] reports similardetection rates
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Top 10 metrics
1. Binary variable indicating if homepage is the page withmaximum PageRank of the site2. Edge reciprocity3. Different hosts at distance 44. Different hosts at distance 35. Minimum change of supporters (different hosts)6. Different hosts at distance 27. TruncatedPagerank (T=1) / PageRank8. TrustRank score divided by PageRank9. Different hosts at distance 110. TruncatedPagerank (T=2) / PageRank
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Conclusions
V Link-based statistics to detect 80% of spamX No magic bullet in link analysisX Precision still low compared to e-mail spam filtersV Measure both home page and max. PageRank pageV Host-based counts are important
Further results at WebKDD 2006 Next step: combine linkanalysis and content analysis
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
If you want privacy, spamize your data!
Thank you!
Questions?
Soon: Spam detection test collection: 11,000 classified hosts
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).
Generalizing PageRank: Damping functions for link-basedranking algorithms.
In Proceedings of SIGIR, Seattle, Washington, USA. ACMPress.
Becchetti, L., Castillo, C., Donato, D., Leonardi, S., andBaeza-Yates, R. (2006).
Using rank propagation and probabilistic counting forlink-based spam detection.
In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD), Pennsylvania, USA. ACM Press.
Benczúr, A. A., Csalogány, K., Sarlós, T., and Uher, M.(2005).
Spamrank: fully automatic link spam detection.
In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web, Chiba, Japan.
http://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions_link_ranking.pdfhttp://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions_link_ranking.pdfhttp://www.dcc.uchile.cl/~ccastill/papers/becchetti_06_automatic_link_spam_detection_rank_propagation.pdfhttp://www.dcc.uchile.cl/~ccastill/papers/becchetti_06_automatic_link_spam_detection_rank_propagation.pdf
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Fetterly, D., Manasse, M., and Najork, M. (2004).
Spam, damn spam, and statistics: Using statistical analysis tolocate spam web pages.
In Proceedings of the seventh workshop on the Web anddatabases (WebDB), pages 1–6, Paris, France.
Flajolet, P. and Martin, N. G. (1985).
Probabilistic counting algorithms for data base applications.
Journal of Computer and System Sciences, 31(2):182–209.
Gibson, D., Kumar, R., and Tomkins, A. (2005).
Discovering large dense subgraphs in massive graphs.
In VLDB ’05: Proceedings of the 31st international conferenceon Very large data bases, pages 721–732. VLDB Endowment.
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Gyöngyi, Z., Molina, H. G., and Pedersen, J. (2004).
Combating web spam with trustrank.
In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB), pages 576–587, Toronto,Canada. Morgan Kaufmann.
Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001).
Random graphs with arbitrary degree distributions and theirapplications.
Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).
Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).
Detecting spam web pages through content analysis.
In Proceedings of the World Wide Web conference, pages83–92, Edinburgh, Scotland.
http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077http://citeseer.ist.psu.edu/flajolet85probabilistic.htmlhttp://portal.acm.org/citation.cfm?id=1083676http://dbpubs.stanford.edu:8090/pub/2004-17http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed% &dopt=Abstract&list_uids=11497662http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed% &dopt=Abstract&list_uids=11497662http://dx.doi.org/10.1145/1135777.1135794
-
Link-BasedSpam Detection
L. Becchetti,C. Castillo,D. Donato,
S. Leonardi andR. Baeza-Yates
Motivation
Degree-basedmeasures
PageRank
TrustRank
TruncatedPageRank
Countingsupporters
Conclusions
Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).
ANF: a fast and scalable tool for data mining in massivegraphs.
In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages81–90, New York, NY, USA. ACM Press.
http://dx.doi.org/10.1145/775047.775059http://dx.doi.org/10.1145/775047.775059
MotivationDegree-based measuresPageRankTrustRankTruncated PageRankCounting supportersConclusions