Web Data Mining - ioe.nchu.edu.t 14... · Web Mining is the use of the data mining techniques to...
Transcript of Web Data Mining - ioe.nchu.edu.t 14... · Web Mining is the use of the data mining techniques to...
Web Data Mining
5/28/2010 2
Web Mining
Web Mining is the use of the data mining techniques to automatically discover and extract information from web documents/services
Discovering useful information from the World-Wide Web and its usage patterns
Using data mining techniques to make the web more useful and more profitable (for some) and to increase the efficiency of our interaction with the web
5/28/2010 3
Web Mining
Data Mining Techniques Association rules Sequential patterns Classification Clustering Outlier discovery
Applications to the Web E-commerce Information retrieval (search) Network management
5/28/2010 4
Examples of Discovered Patterns
Association rules 98% of AOL users also have E-trade accounts
Classification People with age less than 40 and salary > 40k trade on-line
Clustering Users A and B access similar URLs
Outlier Detection User A spends more than twice the average amount of time
surfing on the Web
5/28/2010 5
Web Mining
The WWW is huge, widely distributed, global information service centre for Information services: news, advertisements, consumer
information, financial management, education, government, e-commerce, etc. Hyper-link information Access and usage information
WWW provides rich sources of data for data mining
5/28/2010 6
Why Mine the Web?
Enormous wealth of information on Web Financial information (e.g. stock quotes) Book/CD/Video stores (e.g. Amazon) Restaurant information (e.g. Zagats) Car prices (e.g. Carpoint)
Lots of data on user access patterns Web logs contain sequence of URLs accessed by users
Possible to mine interesting nuggets of information People who ski also travel frequently to Europe Tech stocks have corrections in the summer and rally from
November until February
5/28/2010 7
Why is Web Mining Different?
The Web is a huge collection of documents except for Hyper-link information Access and usage information
The Web is very dynamic New pages are constantly being generated
Challenge: Develop new Web mining algorithms and adapt traditional data mining algorithms to Exploit hyper-links and access patterns Be incremental
5/28/2010 8
Web Mining Applications
E-commerce (Infrastructure) Generate user profiles Targetted advertizing Fraud Similar image retrieval
Information retrieval (Search) on the Web Automated generation of topic hierarchies Web knowledge bases Extraction of schema for XML documents
Network Management Performance management Fault management
5/28/2010 9
User Profiling
Important for improving customization – Provide users with pages, advertisements of interest– Example profiles: on-line trader, on-line shopper
Generate user profiles based on their access patterns– Cluster users based on frequently accessed URLs– Use classifier to generate a profile for each cluster
Engage technologies– Tracks web traffic to create anonymous user profiles of Web
surfers– Has profiles for more than 35 million anonymous users
5/28/2010 10
Internet Advertizing
Ads are a major source of revenue for Web portals (e.g., Yahoo, Lycos) and E-commerce sites
Plenty of startups doing internet advertizing– Doubleclick, AdForce, Flycast, AdKnowledge
Internet advertizing is probably the “hottest” web mining application today
5/28/2010 11
Scheme 1:– Manually associate a set of adsa set of ads with each user profile– For each user, display an ad from the set based on profile
Scheme 2:– Automate association between ads and users– Use ad click information to cluster users (each user is
associated with a set of ads that he/she clicked on)– For each cluster, find ads that occur most frequently in the
cluster and these become the ads for the set of users in the cluster
Internet Advertizing
5/28/2010 12
Use collaborative filtering (e.g. Likeminds, Firefly)– Each user Ui has a rating for a subset of ads (based on click
information, time spent, items bought etc.)– Rij - rating of user Ui for ad Aj
Problem: Compute user Ui’s rating for an unrated ad Aj
A1 A2 A3
?
Internet Advertizing
5/28/2010 13
Key Idea: User Ui’s rating for ad Aj is set to Rkj, where UUkk is the user whose rating of ads is most similar to is the user whose rating of ads is most similar to UUii’’ss
User Ui’s rating for an ad Aj that has not been previously displayed to Ui is computed as follows:– Consider a user Uk who has rated ad Aj
– Compute Dik, the distance between Ui and Uk’s ratings on common ads
– Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik)– Display to Ui ad Aj with highest computed rating
Internet Advertizing
5/28/2010 14
Fraud
With the growing popularity of E-commerce, systems to detect and prevent fraud on the Web become important
Maintain a signaturesignature for each user based on buying patterns on the Web (e.g., amount spent, categories of items bought)
If buying pattern changes significantly, then signal fraud
HNC software uses domain knowledge and neural networks for credit card fraud detection
5/28/2010 15
Retrieval of Similar Images
Given:– A set of images
Find:– All images similar to a given image– All pairs of similar images
Sample applications:– Medical diagnosis– Weather predication– Web search engine for images– E-commerce
5/28/2010 16
QBIC, Virage, Photobook Compute feature signature for each image
– QBIC uses color histograms– WBIIS, WALRUS use wavelets
Use spatial index to retrieve database image whose signature is closest to the query’s signature
WALRUS decomposes an image into regions – A single signature is stored for each region– Two images are considered to be similar if they have enough
similar region pairs
Retrieval of Similar Images
5/28/2010 17
Images retrieved by WALRUS
Query image
5/28/2010 18
Problems with Web Search Today
Today’s search engines are plagued by problems:– the abundance problem (99% of info of no interest
to 99% of people)– limited coverage of the Web (internet sources
hidden behind search interfaces)Largest crawlers cover < 18%cover < 18% of all web pages
– limited query interface based on keywordkeyword--orientedorientedsearch
– limited customization to individual users– Web is highly dynamic
Lot of pages added, removed, and updated every day
– Very high dimensionality
5/28/2010 19
Improve Search By Adding Structure to the Web
Use Web directories (or topic hierarchies)– Provide a hierarchical classification of documents (e.g., Yahoo!)
Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic
Recreation Science Business News
Yahoo home page
SportsTravel Companies Finance Jobs
5/28/2010 20
Automatic Creation of Web Directories
In the Clever project, hyper-links between Web pages are taken into account when categorizing them– Use a bayesian classifier – Exploit knowledge of the classes of immediate neighbors of
document to be classified– Show that simply taking text from neighbors and using
standard document classifiers to classify page does not work
Inktomi’s Directory Engine uses “Concept Induction”to automatically categorize millions of documents
5/28/2010 21
Network Management
Objective: To deliver content to users quickly and reliably – Traffic management– Fault management
Service Provider NetworkRouterServer
5/28/2010 22
Why is Traffic Management Important?
While annual bandwidth demand is increasing ten-fold on average, annual bandwidth supply is rising only by a factor of three
Result is frequent congestion at servers and on network links– during a major event (e.g., princess diana’s death), an
overwhelming number of user requests can result in millions of redundant copies of data flowing back and forth across the world
– Olympic sites during the games– NASA sites close to launch and landing of shuttles
5/28/2010 23
Traffic Management
Key Ideas
– Dynamically replicate/cachereplicate/cache content at multiple sites within the network and closer to the user
– Multiple paths between any pair of sites
– Route user requests to server closest to the user or least loaded server
– Use path with least congested network links
Akamai, Inktomi(Content Delivery Service)
5/28/2010 24
Traffic Management
Service Provider NetworkRouterServer
Request
Congestedserver
Congestedlink
5/28/2010 25
Traffic Management
Need to mine network and Web traffic to determine – What content to replicate?– Which servers should store replicas?– What path to use to route packets?– Which server to route a user request?
Network Design issues– Where to place servers?– Where to place routers?– Which routers should be connected by links?
One can use association rules, sequential pattern mining algorithms to cache/prefetch replicas at server
5/28/2010 26
Fault Management
Fault management involves– Quickly identifying failed/congested servers and links in
network– Re-routing user requests and packets to avoid
congested/down servers and links
Need to analyze alarm and traffic data to carry out root cause analysis of faults
Bayesian classifiers can be used to predict the root cause given a set of alarms
5/28/2010 27
Web Mining Issues
Size– Grows at about 1 million pages a day– Google indexes 9 billion documents– Number of web sites
Netcraft survey says 72 million sites(http://news.netcraft.com/archives/web_server_survey.html)
Diverse types of data– Images– Text – Audio/video– XML– HTML
5/28/2010 28
Number of Active Sites
Total Sites Across All Domains August 1995 - October 2007
5/28/2010 29
Systems Issues
– Web data sets can be very large Tens to hundreds of terabytes
– Cannot mine on a single server! Need large farms of servers
– How to organize hardware/software to mine multi-terabye data setsWithout breaking the bank!
5/28/2010 30
Different Data Formats
Structured Data Unstructured Data OLE DB offers some solutions!
5/28/2010 31
Web Data
Web pages Intra-page structures Inter-page structures Usage data logs Supplemental data
– Profiles– Registration information– Cookies
5/28/2010 32
Web Mining
Web Mining– is the application of data mining techniques to discover
patterns from the Web.
Data Mining– also called Knowledge-Discovery in Databases
(KDD) is the process of automatically searching large volumes of data for patterns (extracting Knowledge from data)
Web Mining
Web Content Mining(WCM)
Web Structure Mining(WSM)
Web Usage Mining(WUM)
5/28/2010 33
Web Mining
Web Structure Mining– using the graph theory to analyse the node and connection
structure of a web site– Retrieving pages that are not only relevant, but also of high
quality, or authoritative on the topic
Web Content Mining– discover useful information from the content of a web page. The
type of the web content may consist of text, image, audio or video data in the web
Web Usage Mining– analyse and discover interesting patterns of user’s usage data on
the web. The usage data records the user’s behaviour when the user browses or makes transactions on the web site.
5/28/2010 34
Web Structure Mining
Mine structure (links, graph) of the Web Techniques
– PageRank– CLEVER
Create a model of the Web organization. May be combined with content mining to more
effectively retrieve important pages.
5/28/2010 35
PageRank
Used by Google Prioritize pages returned from search by
looking at Web structure. Importance of page is calculated based on
number of pages which point to it –Backlinks.
Weighting is used to provide more importance to backlinks coming form important pages.
5/28/2010 36
Ranking web pages
Web pages are not equally “important”– www.joe-schmoe.com v www.stanford.edu
Inlinks as votes– www.stanford.edu has 23,400 inlinks– www.joe-schmoe.com has 1 inlink
Are all inlinks equal?
5/28/2010 37
Simple “flow” model
The web in 1839
Yahoo
M’softAmazon
y
a m
y/2
y/2
a/2
a/2
m
y = y /2 + a /2a = y /2 + mm = a /2
5/28/2010 38
Matrix formulation
Matrix M has one row and one column for each web page
Suppose page j has n outlinks– If j → i, then Mij=1/n
– Else Mij=0
M is a column stochastic matrix– Columns sum to 1
Suppose r is a vector with one entry per web page– ri is the importance score of page i– Call it the rank vector– |r| = 1
5/28/2010 39
Example of Constructing M
Suppose page j links to 3 pages, including i
i
j
M r r
=i
1/3
5/28/2010 40
Another Example of Constructing M
5/28/2010 41
Eigenvector formulation
The flow equations can be written r = M r
So the rank vector is an eigenvector of the stochastic web matrix– In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
5/28/2010 42
Example
Yahoo
M’softAmazon
y 1/2 1/2 0a 1/2 0 1m 0 1/2 0
y a m
y = y /2 + a /2a = y /2 + mm = a /2
r = Mr
y 1/2 1/2 0 ya = 1/2 0 1 am 0 1/2 0 m
5/28/2010 43
Power Iteration method
Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T
Iterate: rk+1 = Mrk
Stop when |rk+1 - rk|1 < – |x|1 = 1≤i≤N|xi| is the L1 norm – Can use any other vector norm e.g., Euclidean
5/28/2010 44
Power Iteration Example
Yahoo
M’softAmazon
y 1/2 1/2 0a 1/2 0 1m 0 1/2 0
y a m
ya =m
1/31/31/3
1/31/21/6
5/121/31/4
3/811/241/6
2/52/51/5
. . .
5/28/2010 45
Problems with the Pagerank Algorithm
Yahoo
M’softAmazon
y 1/2 1/2 0a 1/2 0 0m 0 1/2 1
y a m
ya =m
111
11/23/2
3/41/27/4
5/83/82
003
. . .
Microsoft becomes a spider trap
5/28/2010 46
PageRank Calculation with Damping Factor
The PageRank of a page A is given as follows:
– d is damping factor – PR(Tn): the PageRank of page Tn.– C(Tn): the number of outgoing links of page Tn.
TnCTnPRTCTPRddAPR /1/11
(d is set to be 0.85 usually)
5/28/2010 47
Example of Calculation (1/5)
Page A
Page C
Page B
Page D
Given d = 0.85
5/28/2010 48
Example of Calculation (2/5)
Page A 1
Page C1
Page B1
Page D1
0.85*(1/2)
0.85*(1/2)0.85*(1/1)
0.85*(1/1)
0.85*(1/1)
1111
r0
5/28/2010 49
Page A: 0.15 (not transferred) +0.85 (from Page C) = 1Page B: 0.15 (not transferred) +0.85*(1/2)(from Page A) = 0.575Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page
A) + 0.15 (not transferred) = 2.275Page D: receives none, but has not transferred 0.15 = 0.15
Page A 1
Page C2.275
Page B0.575
Page D0.15
15.0275.2575.01
r1
Example of Calculation (3/5)
5/28/2010 50
Page A: 2.275*0.85 (from Page C) + 0.15 (not transferred) =2.08375Page B: 1*0.85/2 (from Page A) + 0.15 (not transferred) = 0.575Page C: 0.15*0.85 (from Page D) + 0.575*0.85(from Page B) + 1*0.85/2 (from
Page A) +0.15 (not transferred) = 1.19125Page D: receives none, but has not transferred 0.15 = 0.15
Page A 2.03875
Page C1.1925
Page B0.575
Page D0.15
Example of Calculation (4/5)
15.01925.1575.0
04875.2
r2
5/28/2010 51
Example of calculation (5/5)
After 20 iterations, we get
Page A 1.490
Page C1.577
Page B0.783
Page D0.15
15.0577.1783.0490.1
r20
5/28/2010 52
PageRank Calculation with Random teleports
The Google solution for spider traps At each time step, the random surfer has two
options:
– With probability , follow a link at random
– With probability 1-, jump to some page uniformly at random
Surfer will teleport out of spider trap within a few time steps
is in the range 0.8 to 0.9 commonly.
5/28/2010 53
Page Rank with Random teleports
Construct the N×N matrix A as follows
– Aij = Mij + (1-)/N Verify that A is a stochastic matrix The page rank vector r is the principal
eigenvector of this matrix– satisfying r = Ar
Equivalently, r is the stationary distribution of the random walk with teleports
5/28/2010 54
Yahoo
M’softAmazon
1/21/20
y
0.8*1/31/31/3
y
+ 0.2*
1/2
Yahoo
M’softAmazon
0.8*1/2
0.8*1/2
0.2*1/3
0.2*1/3
0.2*1/3
1/2
y 1/2a 1/2m 0
y
Random teleports ( )
5/28/2010 55
Random teleports ( )
Yahoo
M’softAmazon
1/2
1/2
0.8*1/2
0.8*1/2
0.2*1/3
0.2*1/3
0.2*1/3y 1/2a 1/2m 0
y1/21/20
y
0.8*1/31/31/3
y
+ 0.2*
1/2 1/2 01/2 0 00 1/2 1
M =
1/2 1/2 01/2 0 00 1/2 1
1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3
7/15 7/15 1/157/15 1/15 1/151/15 7/15 13/15
0.8 + 0.2A = =
5/28/2010 56
Random teleports ( )
7/15 7/15 1/157/15 1/15 1/151/15 7/15 13/15
A =
. . .
rk+1 = Ark
ya =m
111
r0
1.000.601.40
r1
0.840.601.56
r2
0.7760.5361.688
r3
7/115/11
21/11
rk
5/28/2010 57
Page Rank for a Large Network
Key step is matrix-vector multiplication– rnew = Arold
Easy if we have enough main memory to hold A, rold, rnew
However, As N = 1 billion pages– We need 4 bytes for each entry (say)– 2 billion entries for vectors, approx 8GB– Matrix A has N2 entries
1018 is a large number!
Solutions– Block-Stripe Update algorithm
5/28/2010 58
CLEVER (HITS Model)
Interesting documents fall into two classes Authoritative Pages :
– Highly important pages.– Best source for requested information.– Ex. course home pages
Hub Pages :– Contain links to highly important pages.– Ex. course bulletin
Proposed by IBM
5/28/2010 59
Mutually recursive definition
A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node
– Hub score and Authority score– Represented as vectors h and a
Hubs AuthoritiesAuthorities
Idealized view
5/28/2010 60
HITS
Hyperlink-Induces Topic Search Based on a set of keywords, find set of relevant pages –
R. Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or from R.– Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.
5/28/2010 61
Transition Matrix A
HITS uses a matrix A– A[i, j] = 1 if page i links to page j, 0 if not
AT, the transpose of A– similar to the PageRank matrix M, but AT has 1’s
where M has fractions
5/28/2010 62
Hub and Authority Equations
The hub score of page P is proportional to the sum of the authority scores of the pages it links to– h = λAa– Constant λ is a scale factor
The authority score of page P is proportional to the sum of the hub scores of the pages it is linked from– a = μAT h– Constant μ is scale factor
5/28/2010 63
Iterative algorithm
1. Initialize h, a to all 1’s2. h = Aa3. Scale h so that its max entry is 1.0 4. a = ATh5. Scale a so that its max entry is 1.06. Continue until h, a converge
5/28/2010 64
Example
Yahoo
M’softAmazon
y 1 1 1a 1 0 1m 0 1 0
y a m
A =
5/28/2010 65
Example
1 1 1A = 1 0 1
0 1 0
1 1 0AT = 1 0 1
1 1 0
a(yahoo)a(amazon)a(m’soft)
===
111
111
14/51
10.751
. . .
. . .
. . .
10.7321
h(yahoo) = 1h(amazon) = 1h(m’soft) = 1
12/31/3
10.730.27
. . .
. . .
. . .
1.0000.7320.268
10.710.29
Scale to make its max entry is 1.0
5/28/2010 66
PageRank v.s. Authorities
PageRank(Google)– computed for all web
pages stored in the database prior to the query
– computes authorities only
– Trivial and fast to compute
HITS(CLEVER, IBM)– performed on the set
of retrieved web pages for each query
– computes authorities and hubs
– easy to compute, but real-time execution is hard
5/28/2010 67
PageRank Algorithm (Damping)
initialize ranks R0
while (not converged)for each vertex i
endend
iBj j
kk N
jRddiR )()1()(1
5/28/2010 68
HITS Algorithm
initialize authority and hub weights, x0 and y0
while (not converged)for each vertex I
endend
iFj
kk jxiy )()(1
iBj
kk jyix )()(1