Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST...

24
Improving Web Spam Classification using Rank- time Features September 25, 2008 TaeSeob ,Yun KAIST DATABASE & MULTIMEDIA LAB

Transcript of Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST...

Page 1: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

Improving Web Spam Classification us-ing Rank-time Features

September 25, 2008

TaeSeob ,YunKAIST

DATABASE & MULTIMEDIA LAB

Page 2: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 2Setember 25, 2008

Contents

Introduction

Support Vector Machine

Data Set

Domain Separation

Rank-time features

Evaluation

Summary

Page 3: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 3Setember 25, 2008

Introduction

World Wide Web(WWW) Definition

An information space in which the items of interest, re-ferred to as resources, are identified by global identi-fiers [IAN04]

Description Too much information Needs Web Search Engines

Page 4: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 4Setember 25, 2008

Introduction

Web Search Engine Definition

A search engine designed to search for information on the World Wide Web [WIK08]

Description Retrieves pages relevant to users’ query Ranking is become important Web Spam interferes Web Search Engines

Page 5: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 5

Web Spam(1/2) Definition

A page that uses bad method to improve ranking [KRI07]

Object Mislead web search engines’ rank algorithm Make profit by increase page’s traffic

Reason why we should remove Web Spam Users spend too much time to search for information Ranking on search engines is critical for making profit Reduce search engine’s resources

Setember 25, 2008

Page 6: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 6

Web Spam(2/2) Type of Web spam

Link stuffing Keyword stuffing Cloaking Web farming

When to remove Web Spam Crawl-time Index-time Rank-time

How to remove Web Spam By training machine – Support Vector Machine(SVM)

Setember 25, 2008

Page 7: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 7

Support Vector Machine(1/2) Definition

A set of related supervised learning methods used for classi-fication and regression[WIK08]

Description Find separating hyperplane with maximal margin on vector

space

Setember 25, 2008

<2 dimensions>

n dimensions ?

v1

v2

<3 dimensions>

Page 8: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 8

Support Vector Machine(2/2)

Procedure Collect Datasets Classify Datasets into Training Datasets and Test Dataset Train the machine with Training Datasets Test the machine with Test Dataset

Problem We need to collect Datasets

Setember 25, 2008

Page 9: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 9

Dataset Definition

A set of labeled sample data for training and test

Collecting Procedure Collect common query lists from MSN Live search engine Label each of top-10 result as spam, non-spam or unknown

by human judge Classify dataset into training datasets and a test dataset

Classification method on datasets Very important! We choose Domain Separation

Setember 25, 2008

Page 10: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 10

Domain Separation(1/6) Definition

A classification method that classify according to domains

Procedure(in this paper) For each URL from dataset Calculate hash value by domain If a new hash value comes, assign it randomly into 5 files If the hash value comes again, put into the assigned file Adjust 5 files into similar size

Why should we choose Domain Separation?

Setember 25, 2008

Page 11: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 11

Domain Separation(2/6) Domain separated vs. Randomly separated

Opinion Domain separated datasets are better The result trained with randomly separate dataset is WRONG! It’s general classification problem in machine learning

Reason If there exists subsets in dataset, and they has features, we should

use those features In fact, some spammers buy a domain for making spam page, it’s

common that whole pages related that domain labeled spam

How to make domain separated datasets?

Setember 25, 2008

Page 12: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 12

Domain Separation(3/6)

Five-fold cross validation Definition

A method for training and test the SVM using in this paper

Procedure Choose one of five domain-separated datasets as a test set Choose other domain-separated datasets as training datasets Train the SVM with 4 training datasets Test the SVM with a test set Repeat above procedures at all combination of sets

Setember 25, 2008

Page 13: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 13

Domain Separation(4/6)

The result of domain separation Total 31,300 URLs 3,133 spam labeled URLs(9.99%)

Problem Learning feature vector to subset hash to label may turn out

to be wildly and incorrectly optimistic Leave future work

Setember 25, 2008

Page 14: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 14

Domain Separation(5/6)

Description No duplicated domain Consists 25% spam Couldn’t use domain information Worst-case graph

Setember 25, 2008

Page 15: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 15

Domain Separation(6/6)

Description Add additional feature Consists 10% spam More difficult to detect than 25%

spam

Result Still little bit lower than ran-

domly sep., but it’s worst-case Note : Still couldn’t use domain

information

Setember 25, 2008

Page 16: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 16

FEATA(1/2) Description

Rank independent features

FEATA includes Domain-level features Page-level features Link information

Setember 25, 2008

Page 17: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 17

FEATA(2/2)

Description Average precision 60% at

10.8% recall Consists of 10% spam Not so good

We will add Rank-time fea-tures!

Setember 25, 2008

Page 18: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 18

Rank-time Features

Definition Features using on rank-time

Motivation Every page has feature vector Shape of spam/non-spam pages’ feature vector is different Spammer can’t guess distribution of non-spam feature vector

Consist of Query independent features(FEATB) Query dependent features(FEATQ)

Setember 25, 2008

Page 19: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 19

FEATB

Definition Query independent, rank-time features

Description Page-level features Domain-level features Popularity features Time features

Setember 25, 2008

Page 20: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 20

FEATQ

Definition Query dependent, rank-time features

Description Depend on the match between query and document property Examine for each returned result

Future work Label spam on the URL only, not on the relevance of a URL to a

query

Setember 25, 2008

Page 21: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 21

Evaluation

Micro averaged on five tests

Setember 25, 2008

Page 22: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 22

Summary

Classification of Web Spam is an important problem

We can classify Web Spam by training on the SVM

Making training datasets as domain-separated datasets is very important

Rank-time features improve classification performance by as much as 25% in recall at a set precision

Setember 25, 2008

Page 23: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 23

References [KRY07] Krysta, M., Qiang, W., Chris, J., Aaswath, R,. “Improv-

ing Web Spam Classification using Rank-time Features”, AIR-Web ’07, May 8, 2007

[IAN04] Ian, J., “Architecture of the World Wide Web, Volume One”, W3C Recommendation, Dec 15, 2004

[WIK08] “Web Search Engine”, “Support Vector Machine”, http://wikipedia.org, Sep 25, 2008

Setember 25, 2008

Page 24: Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

DATABASE & MULTIMEDIA LAB 24

Receiver Operating Characteristic

Setember 25, 2008

[Appendix A]