Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained...
Transcript of Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained...
![Page 1: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/1.jpg)
Topical TrustRank: Using Topicality to Combat
Web SpamBaoning Wu, Vinay Goel, Prof. Brian D. Davison
CSE 450 Term Project
![Page 2: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/2.jpg)
IntroductionProblem of spam
No universal technique to combat all types of spam
TrustRank introduced notion of trust to demote spam pages
We improve on TrustRank using topical information - Topical TrustRank
![Page 3: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/3.jpg)
TrustRankLink between two pages signifies trust between them
Initially, human experts select a list of seed sites that are well known and trustworthy
A biased PageRank algorithm is used
Spam sites will have poor trust scores
![Page 4: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/4.jpg)
Issues with TrustRank
Coverage of the seed set may not be broad enough
many different topics exist, each with good pages
TrustRank has a bias towards communities that are heavily represented in the seed set
inadvertently help spammers that fool these communities
![Page 5: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/5.jpg)
Explanation
![Page 6: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/6.jpg)
Suggestions
Propose the use of pages listed in well maintained topic directories as seed pages
Trustworthiness of a page should be differentiated by topics
link between two pages is usually created in a topic specific context
![Page 7: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/7.jpg)
Topical TrustRank
Partition the seed set into topically coherent groups
TrustRank is calculated for each topic
Final ranking is generated by a balanced combination of these topic specific trust scores
![Page 8: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/8.jpg)
Generalized Technique
Partition the seed set
Compute TrustRank for each partition
Combine the trust scores of each partition
![Page 9: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/9.jpg)
Partitioning
Random
By topic
![Page 10: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/10.jpg)
Combination of trust scoresSimple summation
Quality bias
each topic weighted by a bias factor
summation of these weighted topic scores
one such bias: Average PageRank value of the seed pages of the topic
![Page 11: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/11.jpg)
Improvements
Seed Weighting
Seed Filtering
Finer topics hierarchy
![Page 12: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/12.jpg)
Seed Weighting
Instead of assigning an equal weight to each seed page,
assign a weight proportional to its quality / importance
use the normalized PageRank value of each seed page within the seed set
![Page 13: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/13.jpg)
Seed Filtering
Low quality pages may exist in topic directories
Need to filter out these pages
Use PageRank / TrustRank / Topical TrustRank for filtering
![Page 14: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/14.jpg)
Finer Topics Hierarchy
Most researchers use only top level topics
A finer topic hierarchy may be more accurate to categorize pages on the web
In Topical TrustRank, this has the effect of producing better partitions
![Page 15: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/15.jpg)
Data sets• 20M pages from search.ch company
• 35K sites
• 3,589 labeled spam sites
• dir.search.ch
• WebBase data for Jan, 2001
• 65M pages
• DMOZ RDF Jan, 2001
![Page 16: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/16.jpg)
Initial comparison
![Page 17: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/17.jpg)
Ranking• Each page has three rankings:
• PageRank, TrustRank and Topical TrustRank
• Pages are put into 20 buckets
• Sum of values of pages within each bucket is 5%
![Page 18: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/18.jpg)
Metrics
• Number of spam pages within top buckets
• Overall movement
• The sum of movement for each spam page
![Page 19: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/19.jpg)
Basic results on search.ch data
AlgorithmNo. within top 10
buckets Overall movement
PageRank 90 -
TrustRank 58 4,537
Topical TrustRank 42 4,620
![Page 20: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/20.jpg)
Improvement
MethodNo. within top 10
buckets Overall movement
Seed weighting 37 4,548
Seed filtering 42 4,671
Quality bias 40 4,620
Two-level topic 37 4,604
Combination 33 4,617
![Page 21: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/21.jpg)
Spam sites
![Page 22: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/22.jpg)
Result for WebBase data
• For pages demoted by TrustRank, the spam ratio is 20.2%.
• For pages demoted by Topical TrustRank, the spam ratio is 30.4%.
• For combination of ideas, the spam ratio is 32.9%.
![Page 23: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/23.jpg)
Distribution of the 133 spam pages for WebBase data set
![Page 24: Topical TrustRank: Using Topicality to Combat Web Spambrian/course/2005/webmining/... · maintained topic directories as seed pages Trustworthiness of a page should be differentiated](https://reader035.fdocuments.net/reader035/viewer/2022071001/5fbe681ff987df29fb2cf5a2/html5/thumbnails/24.jpg)
Conclusion• Effective approach to demote spam
• Use of topical information
• Future work
• explore partitioning strategies
• lessons learned may be applied to Personalized Searching techniques
• better techniques to combine trust scores