Web Usage Mining Chris Yang
description
Transcript of Web Usage Mining Chris Yang
Web Usage Mining
Chris Yang
2
Three Phases of Web Usage Mining Discover usage patterns from Web data to
understand and better serve the needs of Web-based applications (Srivastava et al., 2000)
Three phasesPreprocessingPattern discovery Pattern analysis
3
4
Motivation of Web Usage Mining
Bring vendor and end customer in electronic commerce closer
Mass customizationVendor may personalize his product message
for individual customers at a massive scale
5
Data Sources
Sever Web server log explicitly records the browsing
behavior of site visitors and reflects the access of a Web site by multiple users
Formats Common log Extended log
Web log may not be completely reliable Caching – files stored at client but not accessed from server Information pass through the POST method will not be
available in a server log
6
HTTP
The Web's RPC on top of TCP/IP It is stateless, which means that a separate connection is made for
every request Simple to implement, yet incur overhead
Each HTTP client/server interaction consists of a single request/reply interchange
HTTP request HTTP response
7
HTTP request message consists of :1. request line
a) method or command to apply to a server resource e.g. GET, POST
b) URL (without protocol and server domain name)
c) the protocol version used by the client, e.g. HTTP/1.0
2. request header fields Pass additional information about the request and the client itself to the
server - much like RPC parameters Each header filed consists of a name, followed by “:” and the field value
3. the entity body (optional) Clients use it to pass bulk information to the server (CGI)
• Examples of HTTP methods• GET - retrieve the specified URL• POST - send this data to the
specified URL• Examples of HTTP header fields
• Accept - lists acceptable MIME type/subtype contents
• User-Agent - provides client browser information Note: crlf: carriage-return/line-feed
8
HTTP response message1. response header line
– HTTP version, the status of the response, and an explanation of the returned status
2. response header fields– Information that describes the server's
attributes and the returned HTML document to client
3. entity body– Contains an HTML document that a client has
requested Each HTML document needs a separate
request message– stateless
• The result code 200 indicates that the request is successful.
9
Data Source - Server
Web server log in extended log format
10
Data Source - Server
Packet sniffing Monitor network traffic coming to a Web server Extract usage data directly from TCP/IP packets
Cookies Tokens generated by the Web server for individual client browsers to
automatically track the site visitor HTTP protocol is stateless which makes tracking individual users
difficult Cookies rely on implicit user cooperation
Query data CGI scripts
URI for CGI programs may contain additional parameter values to be passed to CGI applications
11
Data Source - Client
Client Remote agent (e.g. Javavscripts or Java applets) Modifying the source code of an existing browser to
enhance data collection capabilities Difficulty - Require client cooperation to enable the
functionality of Javascripts and Java Applets or voluntarily use of the modified browsers
12
Data Source - Proxy
ProxyCaching between client browsers and Web
serversProxy traces may reveal the actual HTTP
request from multiple clients to multiple Web servers
It helps to characterizing the browsing behavior of a group of anonymous users sharing a common proxy server
13
Data Abstractions Data from server, client and proxy helps us to construct data abstractions
Users, server sessions, episodes, click-streams, and page views W3C Web Characterization Activity (WCA) has drafted a Web term definitions
relevant to Web usage (http://www.w3.org/WCA) User – a single individual that is accessing file from one or more Web servers through
a browser Difficulty to identify user – a user may access through different machines or use more than
one agent on a single machine Page view – page view consists of every file that contributes to the display on a
user’s browser at one time Includes several files such as frames, graphics, and scripts When users download a “Web page” by clicking an anchor text or submitting an URL,
he/she is not aware of how many frames, graphics, images, or scripts he/she is receiving Click-stream – a sequential series of page view requests
Server may not have all information to obtain the click-stream Page views through client or proxy-level cache are not available at server
User session – the click-stream of page views for a single user across the entire Web In practice, only the portion of user session that is accessing a particular site can be
identified. Server session – the set of page views in a user session for a particular Web site Episode – any semantically meaningful subset of a user or server session
14
Phase 1 –Preprocessing
Usage Preprocessing Due to the incompleteness of available data, usage preprocessing is a
difficult task Typical problems
Unless client side tracking is used, only IP address, agent, and server-side click stream are available
Single IP address / Multiple server sessions Internet service providers (ISPs) have a pool of proxy servers A proxy server may have several users accessing a Web site, potentially over the
same time period Multiple IP address / Single server sessions
Some ISPs or privacy tools randomly assign each request from a user to one of several IP addresses
Multiple IP address / Single user A user accesses the Web from different machines (multiple IP address from
session to session) Multiple agent / Single user
A user uses more than one browser appears as multiple users
15
Usage Preprocessing
Segmenting click-stream into sessions It is difficult to know when a user leave a Web site A thirty-minute time out is often used (Catledge and
Pitkow, 1995) In some cases, session ID is embedded in each URI,
session is defined by content server Content from user action
Content servers maintain state variables for each active session, the information to determine the content by a user request is not always available
16
Using referrer and agent information, 4 sessions are determined
17
Content Preprocessing and Structure Preprocessing Content Preprocessing
Converting the text, image, scripts, and other multimedia files into forms that are useful for Web usage mining
Classification By content By intended use (Cooley et al., 1999; Pirolli et al., 1996)
Convey information, gather information from user, allow navigation, or combination
Structure Preprocessing Hyperlinks between page views
18
Phase 2 – Pattern Discovery
Statistical Analysis Perform descriptive statistical analysis (such as mean, median,
frequency etc.) on page views, viewing time and length of a navigational path from session file
Web traffic analysis tools produce periodic reports Most frequently accessed pages Average view time of a page Average length of a path through a site
Useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions
19
Association Rules Relate pages that are most often referenced together in a single
server session Sets of pages that are accessed together with a support value
exceeding some specified threshold These page may not directed connected by hyperlinks Useful for Web designers to restructure their Web sites These rules serve as a heuristic for prefetching documents in
order to reduce user-perceived latency when loading a page from a remote site
20
Clustering Group together a set of items having similar
characteristics Usage clusters
Establish groups of users exhibiting similar browsing patterns Useful for inferring user demographics in order to perform
market segmentation Page clusters
Discover groups of pages that have related content Useful for search engines and Web assistance providers
21
Classification Mapping a data item into one of several predefined classes Develop a profile of users belonging to a particular class or
category Requires feature extraction and selection that best describe the
properties of a given class or category Techniques
Decision tree classifiers, naïve Bayesian classifier, k-nearest neighbor classifiers, support vector machines, etc.
E.g. 30% users who place online orders in /Product/Music are in the 19-
25 age group and live on the West coast
22
Sequential Pattern Find inter-session patterns
The presence of a set of items is followed by another item in a time-ordered set of sessions or episode
Useful for predicting future pattern in order to place advertisements for a certain user groups
Temporal analysis Trend analysis, change point detection, or similarity analysis
23
Dependency Modeling Develop a model capable of representing significant
dependencies among the various variables in the Web domain
E.g. A model representing the different stages a visitor undergoes
while shopping in an online store based on the action chosen (from casual visitor to a serious potential buyer)
Techniques Hidden Markov models, Bayesian belief network
24
Phase 3 – Pattern Analysis
Filter out uninteresting rules or patterns from the set found in the pattern discovery phase
25
Major Application Areas for Web Usage Mining (Sriastava et al., 2000)
26
Architecture of the WebSIFT system (Cooley et al., 1999)
27
WUM – Web Usage MinerNavigation behavior in Web sites(Berendt and Spiliopoulou, 2000) Web site is a network of structurally or semantically
interrelated nodes (built in a way that reflects the designers’ intuition).
Quality of a Web site The conformance of the Web site’s structure to the intuition of
each group of visitors accessing the site. Intuition of visitors is indirectly reflected in their navigation behavior
(represented in the browsing pattern) Measure of the quality of Web site
Quality of service (e.g. response time) Quality of navigation Accessibility Information utility Ease of use Attractiveness of the presentation metaphor
28
Sequence Mining Sequence mining supports the discovery of frequent paths composed of not
necessarily adjacent pages Given a collection of transactions ordered in time (each transaction contains a set of
items), discover sequences of maximal length with support above a given threshold A sequence is an ordered list of elements, an element being a set of items appearing
together in a transaction Elements need not be adjacent in time but their ordering in a sequence must not
violate the time ordering of the support transactions Example
Considering a Web site with pages W, A, B, C, D, E and there is a link from W to D WABC (1000 times), WDBC (100 times), WABDEC (400 times) Frequency threshold = 25% WD appears 500 (400+100) times (=33%) and above threshold
In the above example, link from W to D only used 1 out of 5 cases. Therefore, sequence mining is not useful in understanding the usefulness of a hyperlink.
In WUM, a navigation pattern is a directed acyclic graph composed of a group of sequences that conform to a template
The purpose is to determine the usage of which links is responsible for the frequency of sequences
29
WUM – Navigation Sequences and Navigation Patterns A session is a directed list of page accesses performed by a user during his/her visit
in a site A navigation pattern is a structure that
Emphasizes the common parts among the sessions Does not purge the dissimilar parts Annotates both common and non-common parts with quantitative information
P is a set of Web pages in the site If the site is dynamic nature, P is the set of all pages that can be generated
D is a dataset of sessions A session is a directed list of elements from P A sequence of length n is a vector s P N (N is a set of positive integers) U = P N Example
P = {a,b,c,d,e,f,g,h} ab, ac, abcde, bcbf, abdfhe are sessions appearing in D
No. Session Sequence Appearances
1 ab (a,1) (b,1) 40
2 ac (a,1) (c,1) 20
3 abcde (a,1) (b,1) (c,1) (d,1) (e,1) 30
4 bcbf (b,1) (c,1) (b,2) (f,1) 5
5 abdfhe (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) 10
30
Generalized sequences
“wildcard” [low; high] is matched by any sequence of elements that has length at least low and at most high (low 0, high low)
“wildcard” − its range is not of interest A generalized sequence g is a vector g1 g2 … gn
The number of non-wildcard elements in g is the length of g, length(g) Example
(a,1) (b,1) [2;4] (e,1) matches with Session 3 and 5 The group of sequences that match g constitute the “navigation
pattern of g” navp(g) The hits of g, hits(g), is the number of sequences that matched by
g. confidence(gi, gj, g) = hits(g1 … gi-1 gi) / hits(g1 gj)
g = (a,1) (b,1) [2;4] (e,1) hits(g) = 30 + 10 = 40
No. Session Sequence Appearances
1 ab (a,1) (b,1) 40
2 ac (a,1) (c,1) 20
3 abcde (a,1) (b,1) (c,1) (d,1) (e,1) 30
4 bcbf (b,1) (c,1) (b,2) (f,1) 5
5 abdfhe (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) 10
31
Aggregate tree and log
navp(g) is modeled as a tree structure (aggregate tree)
Aggregate log
No. Session Sequence Appearances
1 ab (a,1) (b,1) 40
2 ac (a,1) (c,1) 20
3 abcde (a,1) (b,1) (c,1) (d,1) (e,1) 30
4 bcbf (b,1) (c,1) (b,2) (f,1) 5
5 abdfhe (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) 10
32
Discover navigation pattern
A “template” is a vector comprised of variable ranging over the domain U and of wildcards
A mining query is a template declaration accompanied by a conjunction of constraints on the permissible values of the template variables
Example NODE AS x y z
TEMPLATE x y [2;4] z AS t WHERE x.support 85
AND (y.support / x.support ) 0.8