Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering
description
Transcript of Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering
11
Murat Ali Bayır
Middle East Technical UniversityDepartment of Computer Engineering
Ankara, Turkey
A New Reactive Method for Processing
Web Usage Data
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 22
Web Mining
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Smart-SRASmart-SRA
Agent SimulatorAgent Simulator
Experimental ResultsExperimental Results
ConclusionConclusion
OUTLINE
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 33
Data & Web Mining
Data Mining: Discovery of useful and interesting patterns
from a large dataset. Web mining: the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services.
Dimensions::– Web content mining Web content mining – Web structure mining Web structure mining – Web usage mining Web usage mining
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 44
IP Address Request Time Method URL Protocol Success of
Return Code
Number of Bytes Transmitted
144.123.121.23 [25/Apr/2005:03:04:41–05] GET A.html HTTP/1.0 200 3290
144.123.121.23 [25/Apr/2005:03:04:43–05] GET B.html HTTP/1.0 200 2050
144.123.121.23 [25/Apr/2005:03:04:48–05] GET C.html HTTP/1.0 200 4130
Web Usage Mining (WUM) Application of data mining techniques to web log data in Application of data mining techniques to web log data in
order to discover user access patterns.order to discover user access patterns.
Example User Web Access Log
Web Mining
It is possible to capture necessary information for WUM.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 55
Phases of Web Usage Mining1. Data Processing
– Includes reconstruction of user sessions by using heuristics techniques. (Most important phase) since it directly affects quality of extracted frequent patterns at final step significantly.
2. Pattern Discovery– Includes Discovering useful patterns from reconstructed sessions obtained in the
first phase. We have related work about Pattern Discovery phase [Bayir 06-1].
Web Mining
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 66
Web Mining
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Smart-SRASmart-SRA
Agent SimulatorAgent Simulator
Experimental ResultsExperimental Results
ConclusionConclusion
OUTLINE
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 77
Session ReconstructionIncludes selecting and grouping requests belonging to the same user by using heuristics techniques.Types: – Reactive strategies process requests after they are handled
by the web server, they process web server logs to obtain session. The proposed approach is this thesis is reactive.
– Proactive strategies process requests during the interactive browsing of the web site by the user. Session data is gathered during interaction of web user. applied on dynamic server pages.
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 88
Session Reconstruction
Proactive Strategies need to change internal structure of web site. To illustrate, change in source code of each dynamic web pages.
Reactive strategies need no change, used for web analytics purposes, customers give web logs of their web site and analyzed them by using this methods. Reactive methods are applicable for all web sites satisfying same log format.
Previous Reactive HeuristicsPrevious Reactive Heuristics
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 99
Time-oriented heuristics [Spiliopoulou 98, Cooley 99-1]
Navigation-oriented heuristic Navigation-oriented heuristic [Cooley 99-1, Cooley 99-2]
Smart-SRA [Bayir 06-2] is new approach proposed in this thesis. It combines these heuristics with web topology information in order to increase the accuracy of the reconstructed sessions.
Previous Reactive HeuristicsPrevious Reactive Heuristics
Two types of reactive heuristics defined before
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1010
Example Web Topology Graph used for Applying heuristics
P13 P1
P49
P20 P23
P34
Example Web Page Request Sequence
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 15 29 32 47
Previous Reactive HeuristicsPrevious Reactive Heuristics
The topology of web site can be represented by directed web graph.
The topology information can be extracted by using crawling module of Search engine APIs.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1111
Time-oriented heuristics -1
Time threshold (1 = 30 mins):
1. [P1, P20, P13, P49] (t(P1) - t(P49) = 29 < 30)
2. [P34, P23] (t(P34) - t(P23) = 15 < 30)
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 15 29 32 47
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Two types of time oriented Heuristics defined.
total duration of a discovered session is limited with a threshold 1
Example:
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1212
Time-oriented Heuristics -2
Time threshold (2 = 10 mins):
1. [P1, P20, P13]
2. [P49, P34]
3. [P23]
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 15 29 32 47
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
The time spent on any page is limited with a threshold 2 .
That means t(Pn+1) - t(Pn) < 2
Example:
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1313
Navigation-Oriented Heuristic
In Navigation Oriented Heuristics, when processing user request
sequence, There are two cases for Adding new page WPN+1 to a session [WP1, WP2, …, WPN]
If WPN has a hyperlink to WPN+1
[WP1, WP2, …, WPN, WPN+1]
If WPN does not have a hyperlink to WPN+1
Assume that WPKmax is the nearest page having a hyperlink to WPN+1 add backward browser moves
[WP1, WP2,…, WPN, WPN-1, WPN-2,..., WPKmax, WPN+1]
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1414
Navigation-Oriented Heuristic
Curent Session Condition New Page
[ ] P1
[P1] Link[P1, P20] = 1 P20
[P1, P20] Link[P20, P13] = 0
Link[P1, P13] = 1
P13
[P1, P20, P1, P13] Link[P13, P49] = 1 P49
[P1, P20, P1, P13, P49] Link[P49, P34] = 0
Link[P13, P34] = 1
P34
[P1, P20, P1, P13, P49, P13, P34] Link[P34, P23] =1 P23
[P1, P20, P1, P13, P49, P13, P34, P23]
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Example:User request sequence
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1515
Web Mining
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Smart-SRASmart-SRA
Agent SimulatorAgent Simulator
Experimental ResultsExperimental Results
ConclusionConclusion
OUTLINE
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1616
Smart-SRA
Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria
Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that:
– between each consecutive page pair in a session there is a hyperlink from the previous page to the next page
Topology Rule:
i:1 i<n, there is a hyperlink from Pi to Pi+1
Time Rules: – o i: 1 i<n, Timestam(Pi) < Timestamp(Pi+1) – o i: 1 i<n Timestamp(Pi+1) - Timestamp(Pi) (page stay time) – o Timestamp(Pn) - Timestamp(P1) δ (session duration time).
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1717
Smart-SRA
Phase2 of Smart-SRA process a candidate session from left to right by repeating the following steps until the candidate session is empty:
1. Determine the web pages without any referrer (on its left) and remove them from the candidate session
2. For each one of these pagesFor each previously constructed session– If there is a hyperlink from the last page of the session to the web
page and page stay time constraint is satisfied then append the web page to the session
3. Remove non-maximal sessions
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1818
Example Candidate Session
Page P1 P20 P13 P49 P34 P23
Timestamp 0 6 9 12 14 15
Smart-SRA
P13 P1
P49
P20 P23
P34
Example Web Topology
Used of Applying Smart-SRA
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1919
Smart-SRAIteration 1 (non referers in the set) 2
Candidate Session [P1, P20, P13, P49, P34, P23] [P20, P13, P49, P34, P23]
New Session Set(before)
[P1]
Temp Page Set {P1} {P20, P13}
Temp Session Set [P1] [P1,P20]
[P1,P13]
New Session Set(after)
[P1] [P1,P20]
[P1,P13]
Iteration 3 4
Candidate Session [P49, P34, P23] [P23]
New Session Set(before)
[P1,P20]
[P1,P13]
[P1,P13,P34]
[P1, P13, P49]
[P1, P20]
Temp Page Set {P49, P34} {P23}
Temp Session Set [P1,P13,P34]
[P1, P13, P49]
[P1, P13, P34, P23]
[P1, P13, P49, P23], [P1, P20, P23]
New Session Set(after)
[P1,P13,P34], [P1, P13, P49]
[P1, P20]
[P1, P13, P34, P23] , [P1, P13, P49, P23]
[P1, P20, P23]
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2020
Web Mining
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Smart-SRASmart-SRA
Agent SimulatorAgent Simulator
Experimental ResultsExperimental Results
ConclusionConclusion
OUTLINE
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2121
Agent Simulator
Models the behavior of web users and generates web user navigation and the log data kept by the web server
Used to Used to compare the performances of alternative session reconstruction heuristics
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2222
Agent Simulator
• A Web user can start session with any one of the possible entry pages of a web site.
• A Web user can select the next page having a link from the most recently accessed page.
• A Web user can press the back button one more time and thus selects as the next page a page having a link from any one of the previously browsed pages (i.e., pages accessed before the most recently accessed one).
• A Web user can terminate his/her session.
Provides 4 basic behaviors of Web User.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2323
Web user can start a new session with any one of the possible entry pages of the web site
Start page
New request from server
S1 Session I
S2 Session II
P13 P1
P20 P23
P34
1
S1
P49
2
S2
Agent Simulator
Behavior I
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2424
P13 P1
P49
P20 P23
P34
2
1
Web user can select a new page having a link from the most recently accessed page.
Agent Simulator
Behavior II
Start page
New request from server
S1 Session I
S2 Session II
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2525
P13 P1
P49
P20 P23
P34
2
1
3
4
5
Web user can select as the next page having a link from any one of the previously browsed pages.
Agent Simulator
Behavior III
Start page
New request from server
S1 Session I
S2 Session II
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2626
P13 P1
P49
P20 P23
P34
2
1
3
4
5
6
Web user can terminate the session.
Agent Simulator
Behavior IV
Start page
New request from server
S1 Session I
S2 Session II
Example session is terminated in P23.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2727
3 Parameters for simulating behavior of web user
Session Termination Probability (STP)
Link from Previous pages Probability (LPP)
New Initial page Probability (NIP)
Agent Simulator
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2828
Web Mining
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Smart-SRASmart-SRA
Agent SimulatorAgent Simulator
Experimental ResultsExperimental Results
ConclusionConclusion
OUTLINE
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2929
Heuristics Tested
Time oriented heuristic (heur1)(total time 30 min)
Time oriented heuristic (heur2)(page stay 10 min)
Navigation oriented heuristic (heur3)
Smart-SRA heuristic (heur4)
Experimental Results
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3030
Accuracy is determined as:
Reconstructed session H captures a real session Rif R occurs as a subsequence of H (R H)String-matching relation needed
R = [P1, P3, P5] H = [P9, P1, P3, P5, P8] => R H YesH = [P1, P9, P3, P5, P8] => R H No
Experimental Results
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3131
Parameters for generating user sessions and web topology
Number of web pages (nodes) in topology 300
Average number of outdegree 15
Average number of page stay time 2,2 min
Deviation for page stay time 0,5 min
Number of agents 10000
STP : Fixed & Range 5% 1%-20%
LPP : Fixed & Range 30% 0%-90%
NIP : Fixed & Range 30% 0%-90%
Experimental Results
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3232
Accuracy vs. STPReal Accuracy vs STP
0
10
20
30
40
50
STP
heur1
heur2
heur3
heur4
Experimental Results
Increasing STP leads to sessions with fewer pages. It becomes more easy to predict. In small length sessions the probability of LPP and NIP that holds is also small.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3333
Real Accuracy vs LPP
0
10
20
30
40
50
0 10 20 30 40 50 60 70 80 90
LPP
Rea
l Acc
ura
cy % heur1
heur2
heur3
heur4
Accuracy vs LPP
Experimental Results
As LPP increases the real accuracy decreases. Increasing LPP leads to more complex sessions. Intelligent Path completion is needed for discovering more accurate sessions.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3434
Real Accuracy vs NIP
05
101520253035
0 10 20 30 40 50 60 70 80 90
NIP
Rea
l Acc
uar
cy % heur1
heur2
heur3
heur4
Accuracy vs. NIP
Experimental Results
Increasing NIP causes more complex sessions, the accuracy decreases for all heuristics. Path separation is needed for discovering more accurate sessions.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3535
Web Mining
Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics
Smart-SRASmart-SRA
Agent SimulatorAgent Simulator
Experimental ResultsExperimental Results
ConclusionConclusion
OUTLINE
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3636
Conclusion
New session reconstruction heuristic: Smart-SRA– Does not allow sequences with unrelated consecutive requests
(no hyperlink between the previous one to the next one)– No artificial browser (back) requests insertion in order to prevent
unrelated consecutive requests– Only maximal sessions discovered.
Agent simulator simulates behaviors of real www users.
It is possible to evaluate accuracy of heuristics by using Agent Simulator.
Experimental results show Smart-SRA outperforms previous reactive heuristics.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3737
References
[Bayir 06-1] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006) A Performance Comparison of Pattern Discovery Methods on Web Log Data, AICCSA-06, the 4th ACS/IEEE International Conference on Computer Systems and Applications.
[Bayir 06-2] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006): A New Approach for Reactive Web Usage Data Processing. ICDE Workshops, 44.
[Cooley 99-1] R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for Mining World Wide Web Browsing Patterns . Knowledge and Information Systems Vol. 1, No. 1.
[Cooley 99-2] R. Cooley, P. Tan and J. Srivastava (1999), Discovery of interesting usage patterns from Web data. Advances in Web Usage Analysis and User Profiling. LNAI 1836, Springer, Berlin, Germany. 163-182.
[Spiliopoulou 98] M. Spiliopoulou, L.C. Faulstich (1998). WUM: A tool for Web Utilization analysis. Proceedings EDBT workshop WebDB’98, LNCS 1590, Springer, Berlin, Germany. 184-203.
Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3838
Thank you for Listening Thank you for Listening
Any Questions ? Any Questions ?