Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST...
-
Upload
eustace-hutchinson -
Category
Documents
-
view
214 -
download
0
Transcript of Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST...
Improving Web Spam Classification us-ing Rank-time Features
September 25, 2008
TaeSeob ,YunKAIST
DATABASE & MULTIMEDIA LAB
DATABASE & MULTIMEDIA LAB 2Setember 25, 2008
Contents
Introduction
Support Vector Machine
Data Set
Domain Separation
Rank-time features
Evaluation
Summary
DATABASE & MULTIMEDIA LAB 3Setember 25, 2008
Introduction
World Wide Web(WWW) Definition
An information space in which the items of interest, re-ferred to as resources, are identified by global identi-fiers [IAN04]
Description Too much information Needs Web Search Engines
DATABASE & MULTIMEDIA LAB 4Setember 25, 2008
Introduction
Web Search Engine Definition
A search engine designed to search for information on the World Wide Web [WIK08]
Description Retrieves pages relevant to users’ query Ranking is become important Web Spam interferes Web Search Engines
DATABASE & MULTIMEDIA LAB 5
Web Spam(1/2) Definition
A page that uses bad method to improve ranking [KRI07]
Object Mislead web search engines’ rank algorithm Make profit by increase page’s traffic
Reason why we should remove Web Spam Users spend too much time to search for information Ranking on search engines is critical for making profit Reduce search engine’s resources
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 6
Web Spam(2/2) Type of Web spam
Link stuffing Keyword stuffing Cloaking Web farming
When to remove Web Spam Crawl-time Index-time Rank-time
How to remove Web Spam By training machine – Support Vector Machine(SVM)
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 7
Support Vector Machine(1/2) Definition
A set of related supervised learning methods used for classi-fication and regression[WIK08]
Description Find separating hyperplane with maximal margin on vector
space
Setember 25, 2008
<2 dimensions>
n dimensions ?
v1
v2
<3 dimensions>
DATABASE & MULTIMEDIA LAB 8
Support Vector Machine(2/2)
Procedure Collect Datasets Classify Datasets into Training Datasets and Test Dataset Train the machine with Training Datasets Test the machine with Test Dataset
Problem We need to collect Datasets
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 9
Dataset Definition
A set of labeled sample data for training and test
Collecting Procedure Collect common query lists from MSN Live search engine Label each of top-10 result as spam, non-spam or unknown
by human judge Classify dataset into training datasets and a test dataset
Classification method on datasets Very important! We choose Domain Separation
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 10
Domain Separation(1/6) Definition
A classification method that classify according to domains
Procedure(in this paper) For each URL from dataset Calculate hash value by domain If a new hash value comes, assign it randomly into 5 files If the hash value comes again, put into the assigned file Adjust 5 files into similar size
Why should we choose Domain Separation?
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 11
Domain Separation(2/6) Domain separated vs. Randomly separated
Opinion Domain separated datasets are better The result trained with randomly separate dataset is WRONG! It’s general classification problem in machine learning
Reason If there exists subsets in dataset, and they has features, we should
use those features In fact, some spammers buy a domain for making spam page, it’s
common that whole pages related that domain labeled spam
How to make domain separated datasets?
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 12
Domain Separation(3/6)
Five-fold cross validation Definition
A method for training and test the SVM using in this paper
Procedure Choose one of five domain-separated datasets as a test set Choose other domain-separated datasets as training datasets Train the SVM with 4 training datasets Test the SVM with a test set Repeat above procedures at all combination of sets
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 13
Domain Separation(4/6)
The result of domain separation Total 31,300 URLs 3,133 spam labeled URLs(9.99%)
Problem Learning feature vector to subset hash to label may turn out
to be wildly and incorrectly optimistic Leave future work
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 14
Domain Separation(5/6)
Description No duplicated domain Consists 25% spam Couldn’t use domain information Worst-case graph
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 15
Domain Separation(6/6)
Description Add additional feature Consists 10% spam More difficult to detect than 25%
spam
Result Still little bit lower than ran-
domly sep., but it’s worst-case Note : Still couldn’t use domain
information
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 16
FEATA(1/2) Description
Rank independent features
FEATA includes Domain-level features Page-level features Link information
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 17
FEATA(2/2)
Description Average precision 60% at
10.8% recall Consists of 10% spam Not so good
We will add Rank-time fea-tures!
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 18
Rank-time Features
Definition Features using on rank-time
Motivation Every page has feature vector Shape of spam/non-spam pages’ feature vector is different Spammer can’t guess distribution of non-spam feature vector
Consist of Query independent features(FEATB) Query dependent features(FEATQ)
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 19
FEATB
Definition Query independent, rank-time features
Description Page-level features Domain-level features Popularity features Time features
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 20
FEATQ
Definition Query dependent, rank-time features
Description Depend on the match between query and document property Examine for each returned result
Future work Label spam on the URL only, not on the relevance of a URL to a
query
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 21
Evaluation
Micro averaged on five tests
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 22
Summary
Classification of Web Spam is an important problem
We can classify Web Spam by training on the SVM
Making training datasets as domain-separated datasets is very important
Rank-time features improve classification performance by as much as 25% in recall at a set precision
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 23
References [KRY07] Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improv-
ing Web Spam Classification using Rank-time Features”, AIR-Web ’07, May 8, 2007
[IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004
[WIK08] “Web Search Engine”, “Support Vector Machine”, http://wikipedia.org, Sep 25, 2008
Setember 25, 2008
DATABASE & MULTIMEDIA LAB 24
Receiver Operating Characteristic
Setember 25, 2008
[Appendix A]