Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes...

54
Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011 - Columbia University 1

Transcript of Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes...

Page 1: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Web Mining and Link Analysis

Programming Collective Intelligence – SegaranPadhraic Smyth notes

KDNuggets course notes

Data Mining - Volinsky - Fal 2011 - Columbia University 1

Page 2: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 2

Web Mining v. Data Mining

Web Mining is: Discovering useful information from the World-Wide Web and its usage patterns

• Structure (or lack of it)– Textual information and linkage structure –

unstructured data

• Scale– Data generated per day is comparable to largest

conventional data warehouses

• Speed– Often need to react to evolving usage patterns in real-

time (e.g., merchandising, web security)

Page 3: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 3

What is “the Web”?

• The WWW is huge, widely distributed, global information service center for – Information services: news, advertisements,

consumer information, financial management, education, government, e-commerce, etc.

• Hyper-link structure is what makes it so useful

• provides rich sources for data mining• Essentially, infinite size (>20B pages)

– With lots of duplication

Page 4: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 4

Why Web Mining?

• Useful to study human digital behavior, e.g. search engine data can be used for– Exploration e.g. # of queries per session?– Modeling e.g. any time of day dependence?– Prediction e.g. which pages are relevant?

• Applications– Understand social implications of Web usage– Design of better tools for information access– E-commerce applications– Advertising is a key driver of online business

Page 5: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 5

Advertising Applications• Revenue of many internet companies is driven by

advertising

• Key problem:– Given user data:

• Pages browsed• Keywords used in search• Demographics

– Determine the most relevant ads (in real-time)– Includes bidding/pricing of ads

• Another major problem: “click fraud”– AdSense – place Google ads on your web site– AdWords – buy “keywords” to put on Google search– Determine fraudulent usage through data mining

• Understanding the user is key to these types of applications

Page 6: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 6

Data Sources for Web Mining• Web content

– Text and HTML content on Web pages– User generated content: Blogs, microblogs

(Twitter), social networks

• Web connectivity– Hyperlink/directed-graph structure of the Web

• Web user data– Data on how users interact with the Web

• Navigation data, aka “clickstream” data• Search query data (keywords for users)• Online transaction data• Who has this data?

Page 7: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 7

Accessing data for web mining

• Scripting languages like Perl or Python make web scraping access easy.

• “user-generated” content is meant to be consumed!

• Many websites have APIs for access to data– If there is an API, please follow it!– Can be open: wikipedia, imdb– Can be restricted: facebook, ebay, amazon

• If you are interested, a good book is

Page 8: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Examples of Web Mining Viz

• Volume of data along with useful APIs makes a lot of data available for visualization and analysis.

• Twitter happiness metric• Blogpulse.com

8Data Mining - Volinsky - Fal 2011 - Columbia University

Page 9: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 9

Analyzing User navigation

• Web logs– Record activity between client browser and a specific Web

server– Easily available– Collected by server, ISP

• Search engine records– Text in queries, which pages were viewed, which snippets

were clicked on, etc

• Client-side browsing records– Automatically recorded by client-side software– Harder to obtain, but much more accurate than server-side

logs

• Other sources– Cookies: collected on client/browser, readable by server– Web site registration, purchases, email, etc– ISP recording of Web browsing

Page 10: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 10

Example of Web Log entries

Apache web log:207.237.112.68 - - [25/Oct/2009:06:13:30 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/4.0

(compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

207.237.112.68 - - [25/Oct/2009:06:13:35 -0400] "GET /~volinsky/DataMining/HW/HW4.html HTTP/1.1" 304 - "http://www.research.att.com/~volinsky/DataMining/Columbia.html" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; GTB6; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1; OfficeLiveConnector.1.3; OfficeLivePatch.0.0; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)"

69.86.67.231 - - [25/Oct/2009:10:10:36 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.10) Gecko/2009042700 SUSE/3.0.10-1.1.1 Firefox/3.0.10"

66.234.60.140 - - [25/Oct/2009:10:21:29 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 200 11527 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)"

66.234.60.140 - - [25/Oct/2009:10:21:33 -0400] "GET /~volinsky/DataMining/HW/HW4.html HTTP/1.1" 200 4584 "http://www.research.att.com/~volinsky/DataMining/Columbia.html" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.14) Gecko/2009082707 Firefox/3.0.14 (.NET CLR 3.5.30729)"

68.239.18.39 - - [25/Oct/2009:11:03:11 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9"

68.239.18.39 - - [25/Oct/2009:12:09:47 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_5_8; en-us) AppleWebKit/531.9 (KHTML, like Gecko) Version/4.0.3 Safari/531.9"

66.65.114.97 - - [25/Oct/2009:12:17:15 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 200 11527 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3"

128.59.154.126 - - [25/Oct/2009:13:14:18 -0400] "GET /~volinsky/DataMining/Columbia.html HTTP/1.1" 200 11527 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729)"

128.59.154.126 - - [25/Oct/2009:13:14:21 -0400] "GET /~volinsky/DataMining/HW/HW4.html HTTP/1.1" 200 4584 "http://www.research.att.com/~volinsky/DataMining/Columbia.html" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.3) Gecko/20090824 Firefox/3.5.3 (.NET CLR 3.5.30729

66.249.65.210 - - [01/Oct/2009:05:52:03 -0400] "GET /~volinsky/myrefs.html HTTP/1.1" 200 21698 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Page 11: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 11

Routine Server Log Analysis

• Typical statistics/histograms that are computed– Most and least visited web pages– Entry and exit pages– Referrals from other sites or search engines– What are the searched keywords– How many clicks/page views a page received– Error reports, like broken links

• Many software products that produce standard reports of this type of data– e.g., are there clusters/groups of users that use

the site in different ways?

Page 12: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 12

Descriptive Summary Statistics

Page 13: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 13

Web data measurement issues

• Important to understand how data is collected

• Web data is collected automatically via software logging tools– Advantage:

• No manual supervision required

– Disadvantage:• Data can be skewed (e.g. due to the presence of robot

traffic)

• Important to identify robots (also known as crawlers, spiders)

Page 14: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 14

Robot / human identification

• Removal of robot data is important preprocessing step before any clickstream analysis

• Robots come in all shapes and sizes– Good: Google wants to map the net to provide good search– Bad: Competitor is scraping your web site to see what you are up

to (or steal data)• Robot page-requests often identified using a variety of

heuristics– e.g. some robots self-identify themselves in the server logs

• Robots.txt• Also, robots should identify themselves via the User Agent field in

page requests – Patterns of access

• How would you detect robots? How would you escape detection?

• Tan and Kumar (Journal of Data Mining and Knowledge Discovery, 2002) provide a detailed description of using classification techniques to learn how to detect robots

Page 15: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 15

A time-series plot of UCI Website data

Number of page requests per hour as a function of time from page requests in the www.ics.uci.edu Web server logs during the first week of April 2002.

Page 16: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 16

From Tan and Kumar, 2002

Overallaccuracies

of around 90%were obtainedusing decision

tree classifiers,

Like spam, identifying bots (like spam) is a constant arms race

Page 17: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 17

Sessionizing• Aggregating clicks into sessions can be

useful– e.g., what did you do when you sat at the

computer?• how might you determine this?

Page 18: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 18

Client-side data

• Advantages of collecting data at the client side:– Direct recording of page requests (eliminates

‘masking’ due to caching)– Recording of all browser-related actions by a user

(including visits to multiple websites)– More-reliable identification of individual users (e.g.

by login ID for multiple users on a single computer)

• Preferred mode of data collection for studies of navigation behavior on the Web

• Companies like ComScore and Nielsen use client-side software to track home computer users– but with what biases?

Page 19: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 19

comScore Report 2008

• 185 million U.S. people age 2+ online in a month, spending an average of 29 hours online per person*

• 80% of 824 million global Internet users now outside of U.S.

• 99% of online population search in a month, conducting 22 searches per searcher**

• 75% of online population stream a video, viewing an average of 70 videos per viewer per month*** – Up 36% vs YA

• 66% of online population visit a social networking site, spending 4 hours per month per visitor*

• 40% of online population visit a blog site in a month*

February 2008, U.S., comScore Media Metrix ** February 2008, U.S., comScore qSearch 2.0 *** January 2008, U.S., comScore Video Metrix

Page 20: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 20

Modeling Clickrate Data

• Data– goal is to build a time-series model

that characterizes user click rates– Usually: cluster data into user types

Page 21: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 21

Markov models for page prediction

• Why would we want to predict where a user is surfing? – pre-cached web pages save time

• General approach is to use a finite-state Markov chain– Each state can be a specific Web page or a category of

Web pages– If only interested in the order of visits (and not in time),

each new request can be modeled as a transition of states

• For simplicity, consider order-dependent, time-independent finite-state Markov chain with M states

Page 22: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 22

Markov models for page prediction

• Let s be a sequence of observed states of length L. e.g. s = ABBCAABBCCBBAA with three states A, B and C. st is state at position t (1<=t<=L). In general,

• first-order Markov assumption

• This provides a simple generative model:

∏=

−=L

ttt sssPsPsP

2111 ),...,|()()(

∏=

−=L

ttt ssPsPsP

211 )|()()(

)|(),...,|( 111 −− = tttt ssPsssP

Page 23: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 23

Markov models for page prediction

• If we denote Tij = P(st = j|st-1 = i), we can define a P x P transition matrix

• Each page is a “state”: P can be of the order 105 to 106

• If P is large, we might cluster P pages into M clusters, which now become the states in the Markov model

Page 24: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 24

Markov models for page prediction

• Tij = P(st = j|st-1 = i) represents the probability that an individual user’s next request will be from state j, given they were in state i

• We can add E, an end-state to the model• E.g. for three categories with end state:

• Rows sum to 1• E denotes the end of a sequence, and start of a new

sequence

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

=

1000

)3|()3|3()3|2()3|1(

)2|()2|3()2|2()2|1(

)1|()1|3()1|2()1|1(

EPPPP

EPPPP

EPPPP

T

Page 25: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 25

Markov models for page prediction

• First-order Markov model assumes that the next state is based only on the current state

• This is a strong assumption!– Doesn’t consider ‘long-term memory’

• We can try to capture more memory with kth-order Markov chain (increased complexity) ),..,|(),..,|( 111 kttttt sssPsssP −−− =

Page 26: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 26

Transition probability estimates for Markov model

• Where nij is the number of cases that go from state i to state j. ni is the number of cases starting in state i.

• Smoothed parameter estimates:

• qij is a prior transition matrix• If nij = 0 for some transition (i, j), smoothed version

allows prior knowledge to be incorporated, instead of having a parameter estimate of0.

• If nij > 0, we get a smooth combination of the data-driven information (nij) and the prior. determines how much the prior (qij) matters

++

=i

ijij

ij n

qnT

ijT =nij

ni

Page 27: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Ranking Web Pages

Data Mining - Volinsky - Fal 2011 - Columbia University 27

Page 28: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 28

Ranking web pages• Web pages are not equally “important”

• How do you determine the “importance” of a web page?

• Big Idea: Inlinks are a measure of importance.– Virtualstapler.com = 178– Nytimes.com = 13,000

• Are all inlinks equal?– They are important if linked to by many important sites– Recursive question!

Page 29: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 29

Simple recursive formulation

• Each link’s vote is proportional to the importance of its source page– if pages link to me, my links count

more

• If page P with importance x has n outlinks, each link gets x/n votes

Page 30: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 30

Simple “flow” modelcourtesy Rajaraman, Ullman

Yahoo

M’softAmazon

y

a m

y/2

y/2

a/2

a/2

m

y = y /2 + a /2a = y /2 + mm = a /2

Page 31: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 31

Solving the flow equations

• 3 equations, 3 unknowns, – y+a+m = 1– Solution: y = 2/5, a = 2/5, m = 1/5

• Nice for a small example, but need something more general

Page 32: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 32

Matrix formulation

• Matrix M has one row and one column for each web page

• Suppose page j has n outlinks– If j links to i, then Mij=1/n– Else Mij=0– Columns sum to 1

• Suppose r is a vector with one entry per web page– ri is the importance score of page i– Call it the rank vector

Then, The flow equations can be written r = Mr

• So the rank vector is an eigenvector of the web matrix

Page 33: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 33

Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

Page 34: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 34

Power Iteration method

• Simple iterative scheme• Suppose there are N web pages• Initialize: r0 = [1/N,….,1/N]• Iterate: rk+1 = Mrk

• Stop when |rk+1 - rk|1 <

Page 35: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 35

Power Iteration Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

ya =m

1/31/31/3

1/31/21/6

5/12 1/3 1/4

3/811/241/6

2/52/51/5

. . .

Page 36: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 36

Random Walk Interpretation

• Imagine a random web surfer– At any time t, surfer is on some page P– At time t+1, the surfer follows an

outlink from P uniformly at random– Ends up on some page Q linked from P– Process repeats indefinitely

• Let p(t) be a vector whose ith component is the probability that the surfer is at page i at time t– p(t) is a probability distribution on

pages

Page 37: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 37

The stationary distribution

• Where is the surfer at time t+1?– Follows a link uniformly at random– p(t+1) = Mp(t)

• Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)– Then p(t) is called a stationary distribution

for the random walk

• Our rank vector r satisfies r = Mr– So it is a stationary distribution for the

random surfer– (also, r is an eigenvector of M)

Page 38: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 38

Existence and Uniqueness

A central result from the theory of random walks (aka Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

Page 39: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 39

Spider traps

• A group of pages is a spider trap if there are no links from within the group to outside the group– Random surfer gets trapped

• Spider traps violate the conditions needed for the random walk theorem

• Solution for traps:– At every step, with probability , follow a link at

random– With probability 1-, jump to some page uniformly at

random (teleport)– Common values for are in the range 0.8 to 0.9

• This is the essence of Google’s PageRank algorithm

Page 40: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 40

Matrix formulation

• Suppose there are N pages– Consider a page j, with set of outlinks

O(j)

– We have Mij = 1/|O(j)| when j links to i and Mij = 0 otherwise

– The random teleport is equivalent to• adding a teleport link from j to every other

page with probability (1-)/N• reducing the probability of following each

outlink from 1/|O(j)| to /|O(j)|

Page 41: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 41

Previous example with traps

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 0m 0 1/2 1

y a m

ya =m

1/31/31/3

1/31/61/2

1/41/67/12

5/241/82/3

001

. . .

Page 42: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 42

Previous example with =0.8

Yahoo

M’softAmazon

1/2 1/2 0 1/2 0 0 0 1/2 1

1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.8 + 0.2

ya =m

1/31/31/3

0.330.200.46

0.280.20.52

0.240.170.58

0.212 0.152 0.636

. . .

Page 43: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 43

The Google Model• Google uses a combination of tools:

– TFIDF from query to page retrieval

– PageRank to upweight important pages

– Link text info

• Problems:

– Biased against topic-specific authorities

– Ambiguous queries e.g., jaguar, spears

• Susceptible to Link spam

–Artificial links created in order to boost page rank

– called Google Bombing

“miserable failure”

Page 44: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 44

Other measures of importance

• Hubs and Authorities (Klienberg)– PageRank model works on the assumption that

important pages link to imporant pages– Kleinberg notes that important sites might not link

to each other– authorities

• pages which are prominent for a given topic– hubs

• assemble high-quality guides and direct users to authorities

– A good hub page is one that points to many good authority pages, A good authority page is one that is pointed to by many good hub pages

– each page gets a hub score and an authority score…this helps also in defining web communities

Page 45: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

HITS: Hubs and Authorities

• The HITS algorithm has two basic steps:– Authority Update: Update each node's Authority score to be equal

to the sum of the Hub Scores of each node that points to it. – Hub Update: Update each node's Hub Score to be equal to the sum

of the Authority Scores of each node that it points to.

• Let a be the vector of authority scores and h be the vector of hub scores

• a=[1,1,....1], • h = [1,1,.....1] ; • do a=MTh; h=Ma; • Normalize a and h; • Repeat until a and h converge• The vectors a* and h*represent the authority and hub

weights

Data Mining - Volinsky - Fal 2011 - Columbia University 45

Page 46: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Web Advertising and Auction Models

Data Mining - Volinsky - Fal 2011 - Columbia University 46

Page 47: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 47

History of web advertising

• Banner ads (1995-2001)– Initial form of web advertising– Popular websites charged X$ for every

1000 “impressions” of ad• Called “CPM” rate• Modeled similar to TV, magazine ads

– Untargeted to demographically targeted

– Low clickthrough rates• low ROI for advertisers

Page 48: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 48

Performance-based advertising

• Introduced by Overture around 2000– Advertisers “bid” on search keywords– “second price” auction (why?)– When someone searches for that

keyword, the highest bidder’s ad is shown

– Advertiser is charged only if the ad is clicked on

• Google’s version came out in 2000– Called “Adwords”

Page 49: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 49

Ads vs. search results

Page 50: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 50

Web 2.0

• Performance-based advertising works!– Multi-billion-dollar industry– Ad server is incented to provide best ads for

a given search - they only get paid if successful!

– auction model is sensible…what are you willing to pay?

– Top words :• http://www.cwire.org/highest-paying-search-terms/

• Interesting problems– What ads to show for a search?– If I’m an advertiser, which search terms

should I bid on and how much to bid?

Page 51: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 51

Adwords problem

• A stream of queries arrives at the search engine– q1, q2,…

• Several advertisers bid on each query

• When query qi arrives, search engine must pick a subset of advertisers whose ads are shown

• Goal: maximize search engine’s revenues

Page 52: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 52

Greedy algorithm

• Simplest algorithm is greedy - highest bidder wins!

• Complications:– Each ad has a different likelihood of being

clicked (quality)• advertiser 1 bids $2, click probability = 0.1• Advertiser 2 bids $1, click probability = 0.5• Click probability based on historic data and

statistical model• Maximize both ad revenue and user relevance• Google solution: use the “expected revenue per

click” as a ranking

Page 53: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

Data Mining - Volinsky - Fal 2011 - Columbia University 53

Adwords model

• Max Cost per Click (CPC) determined by advertiser• Quality Score (of ad) is determined by Google• Minimim bid is determined by popularity of search term and by Quality Score of ad• Rank is calculated, and determines position• Actual CPC is the answer to this question: “ what is the lowest amount I can pay, and

still retain my rank?• Actual cost is one cent more than needed to retain rank

– e.g. if A had bid 0.36, she would have been tied with B for Rank, so, charge A $0.37 (as long as this is above the Min Bid

• Google also takes into account advertiser budget, and diversity of rankings

Page 54: Web Mining and Link Analysis Programming Collective Intelligence – Segaran Padhraic Smyth notes KDNuggets course notes Data Mining - Volinsky - Fal 2011.

References

• A nice overview of web mining topics, models and issues:– http://users.atw.hu/ignatius/mining.pdf

• “Discovery of Web Robot Sessions…”– Tan and Kumar (2002)

• Book : Programming Collective Intelligence• Original PageRank paper, published by Brin

and Page when they were in graduate school• Markov Chains for Link Prediction: paper• Kleinberg’s HITS Algorithm: paper

Data Mining - Volinsky - Fal 2011 - Columbia University 54