Web Data Mining - ioe.nchu.edu.t 14... · Web Mining is the use of the data mining techniques to...

Web Data Mining

5/28/2010 2

Web Mining

Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services

Discovering useful information from the World-Wide Web and its usage patterns

Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web

5/28/2010 3

Web Mining

Data Mining Techniques Association rules Sequential patterns Classification Clustering Outlier discovery

Applications to the Web E-commerce Information retrieval (search) Network management

5/28/2010 4

Examples of Discovered Patterns

Association rules 98% of AOL users also have E-trade accounts

Classification People with age less than 40 and salary > 40k trade on-line

Clustering Users A and B access similar URLs

Outlier Detection User A spends more than twice the average amount of time

surfing on the Web

5/28/2010 5

Web Mining

The WWW is huge, widely distributed, global information service centre for Information services: news, advertisements, consumer

information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information

WWW provides rich sources of data for data mining

5/28/2010 6

Why Mine the Web?

Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint)

Lots of data on user access patterns Web logs contain sequence of URLs accessed by users

Possible to mine interesting nuggets of information People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from

November until February

5/28/2010 7

Why is Web Mining Different?

The Web is a huge collection of documents except for Hyper-link information Access and usage information

The Web is very dynamic New pages are constantly being generated

Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental

5/28/2010 8

Web Mining Applications

E-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud Similar image retrieval

Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents

Network Management Performance management Fault management

5/28/2010 9

User Profiling

Important for improving customization – Provide users with pages, advertisements of interest– Example profiles: on-line trader, on-line shopper

Generate user profiles based on their access patterns– Cluster users based on frequently accessed URLs– Use classifier to generate a profile for each cluster

Engage technologies– Tracks web traffic to create anonymous user profiles of Web

surfers– Has profiles for more than 35 million anonymous users

5/28/2010 10

Internet Advertizing

Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites

Plenty of startups doing internet advertizing– Doubleclick, AdForce, Flycast, AdKnowledge

Internet advertizing is probably the “hottest” web mining application today

5/28/2010 11

Scheme 1:– Manually associate a set of adsa set of ads with each user profile– For each user, display an ad from the set based on profile

Scheme 2:– Automate association between ads and users– Use ad click information to cluster users (each user is

associated with a set of ads that he/she clicked on)– For each cluster, find ads that occur most frequently in the

cluster and these become the ads for the set of users in the cluster


5/28/2010 12

Use collaborative filtering (e.g. Likeminds, Firefly)– Each user Ui has a rating for a subset of ads (based on click

information, time spent, items bought etc.)– Rij - rating of user Ui for ad Aj

Problem: Compute user Ui’s rating for an unrated ad Aj

A1 A2 A3

?


5/28/2010 13

Key Idea: User Ui’s rating for ad Aj is set to Rkj, where UUkk is the user whose rating of ads is most similar to is the user whose rating of ads is most similar to UUii’’ss

User Ui’s rating for an ad Aj that has not been previously displayed to Ui is computed as follows:– Consider a user Uk who has rated ad Aj

– Compute Dik, the distance between Ui and Uk’s ratings on common ads

– Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik)– Display to Ui ad Aj with highest computed rating


5/28/2010 14

Fraud

With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important

Maintain a signaturesignature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought)

If buying pattern changes significantly, then signal fraud

HNC software uses domain knowledge and neural networks for credit card fraud detection

5/28/2010 15

Retrieval of Similar Images

Given:– A set of images

Find:– All images similar to a given image– All pairs of similar images

Sample applications:– Medical diagnosis– Weather predication– Web search engine for images– E-commerce

5/28/2010 16

QBIC, Virage, Photobook Compute feature signature for each image

– QBIC uses color histograms– WBIIS, WALRUS use wavelets

Use spatial index to retrieve database image whose signature is closest to the query’s signature

WALRUS decomposes an image into regions – A single signature is stored for each region– Two images are considered to be similar if they have enough

similar region pairs

Retrieval of Similar Images

5/28/2010 17

Images retrieved by WALRUS

Query image

5/28/2010 18

Problems with Web Search Today

Today’s search engines are plagued by problems:– the abundance problem (99% of info of no interest

to 99% of people)– limited coverage of the Web (internet sources

hidden behind search interfaces)Largest crawlers cover < 18%cover < 18% of all web pages

– limited query interface based on keywordkeyword--orientedorientedsearch

– limited customization to individual users– Web is highly dynamic

Lot of pages added, removed, and updated every day

– Very high dimensionality

5/28/2010 19

Improve Search By Adding Structure to the Web

Use Web directories (or topic hierarchies)– Provide a hierarchical classification of documents (e.g., Yahoo!)

Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic

Recreation Science Business News

Yahoo home page

SportsTravel Companies Finance Jobs

5/28/2010 20

Automatic Creation of Web Directories

In the Clever project, hyper-links between Web pages are taken into account when categorizing them– Use a bayesian classifier – Exploit knowledge of the classes of immediate neighbors of

document to be classified– Show that simply taking text from neighbors and using

standard document classifiers to classify page does not work

Inktomi’s Directory Engine uses “Concept Induction”to automatically categorize millions of documents

5/28/2010 21

Network Management

Objective: To deliver content to users quickly and reliably – Traffic management– Fault management

Service Provider NetworkRouterServer

5/28/2010 22

Why is Traffic Management Important?

While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three

Result is frequent congestion at servers and on network links– during a major event (e.g., princess diana’s death), an

overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world

– Olympic sites during the games– NASA sites close to launch and landing of shuttles

5/28/2010 23

Traffic Management

Key Ideas

– Dynamically replicate/cachereplicate/cache content at multiple sites within the network and closer to the user

– Multiple paths between any pair of sites

– Route user requests to server closest to the user or least loaded server

– Use path with least congested network links

Akamai, Inktomi（Content Delivery Service)

5/28/2010 24

Traffic Management

Service Provider NetworkRouterServer

Request

Congestedserver

Congestedlink

5/28/2010 25

Traffic Management

Need to mine network and Web traffic to determine – What content to replicate?– Which servers should store replicas?– What path to use to route packets?– Which server to route a user request?

Network Design issues– Where to place servers?– Where to place routers?– Which routers should be connected by links?

One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server

5/28/2010 26

Fault Management

Fault management involves– Quickly identifying failed/congested servers and links in

network– Re-routing user requests and packets to avoid

congested/down servers and links

Need to analyze alarm and traffic data to carry out root cause analysis of faults

Bayesian classifiers can be used to predict the root cause given a set of alarms

5/28/2010 27

Web Mining Issues

Size– Grows at about 1 million pages a day– Google indexes 9 billion documents– Number of web sites

Netcraft survey says 72 million sites(http://news.netcraft.com/archives/web_server_survey.html)

Diverse types of data– Images– Text – Audio/video– XML– HTML

5/28/2010 28

Number of Active Sites

Total Sites Across All Domains August 1995 - October 2007

5/28/2010 29

Systems Issues

– Web data sets can be very large Tens to hundreds of terabytes

– Cannot mine on a single server! Need large farms of servers

– How to organize hardware/software to mine multi-terabye data setsWithout breaking the bank!

5/28/2010 30

Different Data Formats

Structured Data Unstructured Data OLE DB offers some solutions!

5/28/2010 31

Web Data

Web pages Intra-page structures Inter-page structures Usage data logs Supplemental data

– Profiles– Registration information– Cookies

5/28/2010 32

Web Mining

Web Mining– is the application of data mining techniques to discover

patterns from the Web.

Data Mining– also called Knowledge-Discovery in Databases

(KDD) is the process of automatically searching large volumes of data for patterns (extracting Knowledge from data)

Web Mining

Web Content Mining(WCM)

Web Structure Mining(WSM)

Web Usage Mining(WUM)

5/28/2010 33

Web Mining

Web Structure Mining– using the graph theory to analyse the node and connection

structure of a web site– Retrieving pages that are not only relevant, but also of high

quality, or authoritative on the topic

Web Content Mining– discover useful information from the content of a web page. The

type of the web content may consist of text, image, audio or video data in the web

Web Usage Mining– analyse and discover interesting patterns of user’s usage data on

the web. The usage data records the user’s behaviour when the user browses or makes transactions on the web site.

5/28/2010 34

Web Structure Mining

Mine structure (links, graph) of the Web Techniques

– PageRank– CLEVER

Create a model of the Web organization. May be combined with content mining to more

effectively retrieve important pages.

5/28/2010 35

PageRank

Used by Google Prioritize pages returned from search by

looking at Web structure. Importance of page is calculated based on

number of pages which point to it –Backlinks.

Weighting is used to provide more importance to backlinks coming form important pages.

5/28/2010 36

Ranking web pages

Web pages are not equally “important”– www.joe-schmoe.com v www.stanford.edu

Inlinks as votes– www.stanford.edu has 23,400 inlinks– www.joe-schmoe.com has 1 inlink

Are all inlinks equal?

5/28/2010 37

Simple “flow” model

The web in 1839

Yahoo

M’softAmazon

y

a m

y/2

y/2

a/2

a/2

m

y = y /2 + a /2a = y /2 + mm = a /2

5/28/2010 38

Matrix formulation

Matrix M has one row and one column for each web page

Suppose page j has n outlinks– If j → i, then Mij=1/n

– Else Mij=0

M is a column stochastic matrix– Columns sum to 1

Suppose r is a vector with one entry per web page– ri is the importance score of page i– Call it the rank vector– |r| = 1

5/28/2010 39

Example of Constructing M

Suppose page j links to 3 pages, including i

i

j

M r r

=i

1/3

5/28/2010 40

Another Example of Constructing M

5/28/2010 41

Eigenvector formulation

The flow equations can be written r = M r

So the rank vector is an eigenvector of the stochastic web matrix– In fact, its first or principal eigenvector, with

corresponding eigenvalue 1

5/28/2010 42

Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 ya = 1/2 0 1 am 0 1/2 0 m

5/28/2010 43

Power Iteration method

Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T

Iterate: rk+1 = Mrk

Stop when |rk+1 - rk|1 < – |x|1 = 1≤i≤N|xi| is the L1 norm – Can use any other vector norm e.g., Euclidean

5/28/2010 44

Power Iteration Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

ya =m

1/31/31/3

1/31/21/6

5/121/31/4

3/811/241/6

2/52/51/5

. . .

5/28/2010 45

Problems with the Pagerank Algorithm

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

ya =m

111

11/23/2

3/41/27/4

5/83/82

003

. . .

Microsoft becomes a spider trap

5/28/2010 46

PageRank Calculation with Damping Factor

The PageRank of a page A is given as follows:

– d is damping factor – PR(Tn): the PageRank of page Tn.– C(Tn): the number of outgoing links of page Tn.

TnCTnPRTCTPRddAPR /1/11

(d is set to be 0.85 usually)

5/28/2010 47

Example of Calculation (1/5)

Page A

Page C

Page B

Page D

Given d = 0.85

5/28/2010 48


Page A 1

Page C1

Page B1

Page D1

0.85*(1/2)

0.85*(1/2)0.85*(1/1)

0.85*(1/1)

0.85*(1/1)

1111

r0

5/28/2010 49

Page A: 0.15 (not transferred) +0.85 (from Page C) = 1Page B: 0.15 (not transferred) +0.85*(1/2)(from Page A) = 0.575Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page

A) + 0.15 (not transferred) = 2.275Page D: receives none, but has not transferred 0.15 = 0.15

Page A 1

Page C2.275

Page B0.575

Page D0.15

15.0275.2575.01

r1


5/28/2010 50

Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) =2.08375Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from

Page A) +0.15 (not transferred) = 1.19125Page D: receives none, but has not transferred 0.15 = 0.15

Page A 2.03875

Page C1.1925

Page B0.575

Page D0.15


15.01925.1575.0

04875.2

r2

5/28/2010 51

Example of calculation (5/5)

After 20 iterations, we get

Page A 1.490

Page C1.577

Page B0.783

Page D0.15

15.0577.1783.0490.1

r20

5/28/2010 52

PageRank Calculation with Random teleports

The Google solution for spider traps At each time step, the random surfer has two

options:

– With probability , follow a link at random

– With probability 1-, jump to some page uniformly at random

Surfer will teleport out of spider trap within a few time steps

is in the range 0.8 to 0.9 commonly.

5/28/2010 53

Page Rank with Random teleports

Construct the N×N matrix A as follows

– Aij = Mij + (1-)/N Verify that A is a stochastic matrix The page rank vector r is the principal

eigenvector of this matrix– satisfying r = Ar

Equivalently, r is the stationary distribution of the random walk with teleports

5/28/2010 54

Yahoo

M’softAmazon

1/21/20

y

0.8*1/31/31/3

y

+ 0.2*

1/2

Yahoo

M’softAmazon

0.8*1/2

0.8*1/2

0.2*1/3

0.2*1/3

0.2*1/3

1/2

y 1/2a 1/2m 0

y

Random teleports ( )

5/28/2010 55


Yahoo

M’softAmazon

1/2

1/2

0.8*1/2

0.8*1/2

0.2*1/3

0.2*1/3

0.2*1/3y 1/2a 1/2m 0

y1/21/20

y

0.8*1/31/31/3

y

+ 0.2*

1/2 1/2 01/2 0 00 1/2 1

M =

1/2 1/2 01/2 0 00 1/2 1

1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3

7/15 7/15 1/157/15 1/15 1/151/15 7/15 13/15

0.8 + 0.2A = =

5/28/2010 56


7/15 7/15 1/157/15 1/15 1/151/15 7/15 13/15

A =

. . .

rk+1 = Ark

ya =m

111

r0

1.000.601.40

r1

0.840.601.56

r2

0.7760.5361.688

r3

7/115/11

21/11

rk

5/28/2010 57

Page Rank for a Large Network

Key step is matrix-vector multiplication– rnew = Arold

Easy if we have enough main memory to hold A, rold, rnew

However, As N = 1 billion pages– We need 4 bytes for each entry (say)– 2 billion entries for vectors, approx 8GB– Matrix A has N2 entries

1018 is a large number!

Solutions– Block-Stripe Update algorithm

5/28/2010 58

CLEVER (HITS Model)

Interesting documents fall into two classes Authoritative Pages :

– Highly important pages.– Best source for requested information.– Ex. course home pages

Hub Pages :– Contain links to highly important pages.– Ex. course bulletin

Proposed by IBM

5/28/2010 59

Mutually recursive definition

A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node

– Hub score and Authority score– Represented as vectors h and a

Hubs AuthoritiesAuthorities

Idealized view

5/28/2010 60

HITS

Hyperlink-Induces Topic Search Based on a set of keywords, find set of relevant pages –

R. Identify hub and authority pages for these.

– Expand R to a base set, B, of pages linked to or from R.– Calculate weights for authorities and hubs.

Pages with highest ranks in R are returned.

5/28/2010 61

Transition Matrix A

HITS uses a matrix A– A[i, j] = 1 if page i links to page j, 0 if not

AT, the transpose of A– similar to the PageRank matrix M, but AT has 1’s

where M has fractions

5/28/2010 62

Hub and Authority Equations

The hub score of page P is proportional to the sum of the authority scores of the pages it links to– h = λAa– Constant λ is a scale factor

The authority score of page P is proportional to the sum of the hub scores of the pages it is linked from– a = μAT h– Constant μ is scale factor

5/28/2010 63

Iterative algorithm

1. Initialize h, a to all 1’s2. h = Aa3. Scale h so that its max entry is 1.0 4. a = ATh5. Scale a so that its max entry is 1.06. Continue until h, a converge

5/28/2010 64

Example

Yahoo

M’softAmazon

y 1 1 1a 1 0 1m 0 1 0

y a m

A =

5/28/2010 65

Example

1 1 1A = 1 0 1

0 1 0

1 1 0AT = 1 0 1

1 1 0

a(yahoo)a(amazon)a(m’soft)

===

111

111

14/51

10.751

. . .

. . .

. . .

10.7321

h(yahoo) = 1h(amazon) = 1h(m’soft) = 1

12/31/3

10.730.27

. . .

. . .

. . .

1.0000.7320.268

10.710.29

Scale to make its max entry is 1.0

5/28/2010 66

PageRank v.s. Authorities

PageRank(Google)– computed for all web

pages stored in the database prior to the query

– computes authorities only

– Trivial and fast to compute

HITS(CLEVER, IBM)– performed on the set

of retrieved web pages for each query

– computes authorities and hubs

– easy to compute, but real-time execution is hard

5/28/2010 67

PageRank Algorithm (Damping)

initialize ranks R0

while (not converged)for each vertex i

endend

iBj j

kk N

jRddiR )()1()(1

5/28/2010 68

HITS Algorithm

initialize authority and hub weights, x0 and y0

while (not converged)for each vertex I

endend

iFj

kk jxiy )()(1

iBj

kk jyix )()(1

Web Data Mining - ioe.nchu.edu.t 14... · Web Mining is the use of the data mining techniques to...

Documents

Transcript of Web Data Mining - ioe.nchu.edu.t 14... · Web Mining is the use of the data mining techniques to...