Taking over Search Engines
description
Transcript of Taking over Search Engines
![Page 1: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/1.jpg)
Taking over Search Engines
![Page 2: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/2.jpg)
Web Spamming
What is Spamming ?– Spamming is the art of increasing
the rank of a page. Why ?
– Having more visits means gaining more money. How ?
– Web search engines are the gateways to the web.– Get listed in the top results.
![Page 3: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/3.jpg)
How much Spam out there ?
Real-Web data from the MSN crawler collected during August 2004
105,484,446 Web pages
![Page 4: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/4.jpg)
Why is spam bad ?
For Users:– Useless pages.
For Search Engines:– Wastes bandwidth, CPU cycles, storage space.– Pollutes corpus.– Distorts ranking of results.
(Again bad news for users !)
![Page 5: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/5.jpg)
Techniques
Web Search Engines use a number of measure to estimate the importance of a page– Content Analysis: TF x IDF, …– Link Analysis: PageRank, …
Also spammers use a number of techniques !– Content Manipulation, i.e. terms– Stucture Manipulation, i.e. links
![Page 6: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/6.jpg)
Content Manipulation 1
Repetition Repetition Repetition Repetition Repetition Repetition Repetition Repetition :– Increases the Term Frequency
dumortierite dumose dumous dump dumper dumpage Dumping dumper dumpily :– Makes a document relevant to many queries.– It is effective when using rare words
(Inverse Document Frequency).
![Page 7: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/7.jpg)
Where ?
Body, Title, Meta Tag, Anchor, Url.
![Page 8: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/8.jpg)
Content Manipulation 2
Content Repurposing:– Weaving :
Insertion of spam words into a well formed page copied another web-site.
– Phrase Stitching : Gluing well formed sentences copied from many other
web-sites. Why ?
– Overcomes simple statistics that may be taken into account by web search engines
![Page 9: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/9.jpg)
The Big Picture (1)Techniques / Boosting / Term
Link Bombing
<a href=“target.html”>free, great deals, cheap, inexpensive, cheap, free</a>
![Page 10: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/10.jpg)
Link Manipulation
Links and pages from the attacker point of view
![Page 11: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/11.jpg)
Creating (Hijacked) In-Links
Honey pots.– copies of valuable content (e.g. Unix Man Pages) with
hidden links to spam farms or target pages. Web Directories, Blogs, Wikis
– all of the above usually have high Page Rank, and it is possible to add outgoing links to owned pages.
Link Exchange Buy Expired domain Creating Link Farms
![Page 12: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/12.jpg)
Spamming HITS
HITS algorithm:– Searches for Hubs and Authorities– Top ranked pages are the more authoritative ones
Spam on HITS– Find a collection of good Hubs– Add links from Hubs to the target page– The target page is now linked to good Hubs !!
![Page 13: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/13.jpg)
PageRank
PageRank in one equation:– PR(p) = M + (1- ) Vp
– M is the adjacency matrix of the Web Graph. is the damping factor. (usually .85)– in case of fairness Vp=1/N (N = # of pages in the Web).– V is the personalization vector.
What happens if a page p has no outgoing links ? of its PR is lost --> all the PR will be lost eventually.– solution: normalize rows of M.
(i.e. insert links to every other page)
![Page 14: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/14.jpg)
Aggregate Page Rank
Total page rank is affected by– Number of pages– Incoming Links– Outgoing Links– Dangling Nodes
Topologies that:– Use as many pages as possible– minimize outgoing links– minimize dangling nodes
WEB-SITE
incoming links
outgoing links
![Page 15: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/15.jpg)
Chain topology (more is better)
I a O
0.18 0.34 0.47
I a O
0.11 0.21 0.37
b
0.29
PR (Web Site) = 0.34
PR (Web Site) = 0.21+0.29 = 0.50
I a
0.03 0.07
b
0.09
c
0.12
d
0.14
e
0.16
f
0.17
O
0.18
PR (Web Site) = 0.77
![Page 16: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/16.jpg)
Ring topology
I a O0.18 0.34 0.47
I a O
b
c
d
e
f0.03
0.18
0.11
0.11
0.12
0.13
0.14
0.15PR (Web Site) = 0.86
![Page 17: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/17.jpg)
Clique topology
I a O0.18 0.34 0.47
I a O
b
c
d
e
f0.03
0.18
0.04
0.15
0.15
0.15
0.15
0.15PR (Web Site) = 0.93
![Page 18: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/18.jpg)
Increasing Page Rank of a single target page
Complicated structures do not help– chain, ring, clique
waste page rank among every node in the website To maximize the page rank of a target page a
– all hijacked pages I must point to a– all boosting pages (b,c,d,e,f) must point to a – no links among boosting pages– the target page must point to all of the boosting pages
![Page 19: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/19.jpg)
Star topology
I a O0.18 0.34 0.47
I a O
b
c
de
f
0.03 0.09PR (a) = 0.43
0.09
0.09
0.090.09
0.09
![Page 20: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/20.jpg)
Putting all together
Given many spam farms– Create highly connected topologies among target
pages
Link Exchange – every target page will be rewarded proportionally
to their previous page rank
![Page 21: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/21.jpg)
Is it worth ?
Page rank has a power low distribution– if a page has a low initial PageRank
it is easy to improve it and to get higher ranking– if a page as an higher initial PageRank
it is hard to improve it and it is harder to overcome other pages
Consider that:– it is cheap to generate automatically a link farm, but – spamming is expensive in terms of
registered domains and IPs.
![Page 22: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/22.jpg)
Hiding Techniques
Discriminate between real users and crawlers in order to hide spam activity to both of them
![Page 23: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/23.jpg)
Content Hiding
Use background color for text.– add keywords
Use small 1 pixel anchor images.– add links
![Page 24: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/24.jpg)
Cloaking
Identify whether the request comes from a real user or a search engine and provide different content.
To users:– provide target pages.
To Search Engines– provide useful and interesting text.– provide a link structure that increase PageRank.
Solution:– Download the same page twice.
![Page 25: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/25.jpg)
Redirection
The redirection mechanism is used to create doorways to target pages
Search Engines:– download the page and crawl its links.
Users:– are immediately redirected to a target page.
![Page 26: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/26.jpg)
Why content hiding is tough
HTML code can be parsed trying to detect spam intrusions.
Javascript code can be parsed too, but it is more difficult.
Eventually, it is needed to interpret the code. Crawling is already very expensive !
![Page 27: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/27.jpg)
Link analysis algorithms against web spamming
TrustRank Anti-Trust Rank Truncated Page Rank SpamRank
![Page 28: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/28.jpg)
Trust Rank
Observation– Good pages tend to link good pages.– Human is the best spam detector
Algorithm– Select a small subset of pages and let a human
classify them– Propagate goodness of pages
![Page 29: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/29.jpg)
Trust Rank: Selection
The seed set S should:– be as small as possible– cover a large part of the Web
Covering is related to out-links in the very same way PageRank is related to in-link
– Inverse PageRank ! A small number of pages with the highest Inverse
PageRank is labeled by a human expert.
![Page 30: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/30.jpg)
Trust Rank: Propagation
Initial values– TR(p) = 1, if p was found to be a good page– TR(p) = 0, otherwise
Iterations:– propagate Trust in the same way as PageRank
splitting through out-links damping (attenuation)
– only a fixed number of iteration M.
![Page 31: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/31.jpg)
Trust Rank: Results
![Page 32: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/32.jpg)
Anti-Trust Rank
Goal– find spam pages
Algorithm– Obtain a seed set of spam pages labeled by hand.
(prefer high PageRank)– Compute PageRank Algorithm on the trasnposed adjacency
matrix.– Use the seed set in the personalization vector.– Rank the pages in descending order of their scores.
![Page 33: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/33.jpg)
Anti-Trust Rank
![Page 34: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/34.jpg)
Truncated Page Rank
Observation– Good pages have high page rank because of
pages between 5 and 10 hops away
![Page 35: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/35.jpg)
Truncated Page Rank
Observation– Good pages have high page rank because of
pages between 5 and 10 hops away– Spam pages gain page rank because of pages in
their neighborhood
![Page 36: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/36.jpg)
Truncated Page Rank
Observation– Good pages have high page rank because of
pages between 5 and 10 hops away– Spam pages gain page rank because of pages in
their neighborhood Solution
– promote rank coming from far away– demote rank coming from the closest pages
![Page 37: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/37.jpg)
Truncated Page Rank
Rank propagates through links– only a fraction propagates according to the adjacency matrix M
5 steps of propagation mean M · M · M · M · M = 5·M5
We can calculate the page rank of a page by summing up the contributions from different distances:
– PR(p) = t · Mt = damping(t) · Mt
We can replace n with a function like this:
![Page 38: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/38.jpg)
Truncated Page Rank
Strategy:– Pages whose PageRank is largely different from
its Truncated PageRank are likely to be spam Results:
– Comparable with TrustRank
![Page 39: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/39.jpg)
Spam Rank
Observations:– Spam pages are usually supported by low PageRank
Pages.– Spammers have a limited budget, so they replicate only
what they need for boosting PageRank. Idea:
– Find missing statistical features of dishonest supporters.– Due to the self-similarity, the honest supporter set should
have a power-law distribution of PageRank.
![Page 40: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/40.jpg)
Spam Rank: Algorithm
Find supporters for each page. Check whether each set of supporters follows a
power-law distribution of its PageRank. Create penalties for suspicious pages. Run PageRank using a personalization vector based
on penalties.
Spam Rank is a Measure of Undeserved PageRank
![Page 41: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/41.jpg)
Spam Rank: Results
![Page 42: Taking over Search Engines](https://reader035.fdocuments.net/reader035/viewer/2022062501/56816793550346895ddccce3/html5/thumbnails/42.jpg)
fine.