Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

38
1 Murat Ali Bayır Middle East Technical University Department of Computer Engineering Ankara, Turkey A New Reactive Method for Processing Web Usage Data

description

A New Reactive Method for Processing Web Usage Data. Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering Ankara, Turkey. OUTLINE. Web Mining Previous Session Reconstruction Heuristics Smart-SRA Agent Simulator Experimental Results Conclusion. - PowerPoint PPT Presentation

Transcript of Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Page 1: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

11

Murat Ali Bayır

Middle East Technical UniversityDepartment of Computer Engineering

Ankara, Turkey

A New Reactive Method for Processing

Web Usage Data

Page 2: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 22

Web Mining

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Smart-SRASmart-SRA

Agent SimulatorAgent Simulator

Experimental ResultsExperimental Results

ConclusionConclusion

OUTLINE

Page 3: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 33

Data & Web Mining

Data Mining: Discovery of useful and interesting patterns

from a large dataset. Web mining: the application of data mining techniques to discover and retrieve useful information and patterns from the World Wide Web documents and services.

Dimensions::– Web content mining Web content mining – Web structure mining Web structure mining – Web usage mining Web usage mining

Page 4: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 44

IP Address Request Time Method URL Protocol Success of

Return Code

Number of Bytes Transmitted

144.123.121.23 [25/Apr/2005:03:04:41–05] GET A.html HTTP/1.0 200 3290

144.123.121.23 [25/Apr/2005:03:04:43–05] GET B.html HTTP/1.0 200 2050

144.123.121.23 [25/Apr/2005:03:04:48–05] GET C.html HTTP/1.0 200 4130

Web Usage Mining (WUM) Application of data mining techniques to web log data in Application of data mining techniques to web log data in

order to discover user access patterns.order to discover user access patterns.

Example User Web Access Log

Web Mining

It is possible to capture necessary information for WUM.

Page 5: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 55

Phases of Web Usage Mining1. Data Processing

– Includes reconstruction of user sessions by using heuristics techniques. (Most important phase) since it directly affects quality of extracted frequent patterns at final step significantly.

2. Pattern Discovery– Includes Discovering useful patterns from reconstructed sessions obtained in the

first phase. We have related work about Pattern Discovery phase [Bayir 06-1].

Web Mining

Page 6: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 66

Web Mining

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Smart-SRASmart-SRA

Agent SimulatorAgent Simulator

Experimental ResultsExperimental Results

ConclusionConclusion

OUTLINE

Page 7: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 77

Session ReconstructionIncludes selecting and grouping requests belonging to the same user by using heuristics techniques.Types: – Reactive strategies process requests after they are handled

by the web server, they process web server logs to obtain session. The proposed approach is this thesis is reactive.

– Proactive strategies process requests during the interactive browsing of the web site by the user. Session data is gathered during interaction of web user. applied on dynamic server pages.

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 8: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 88

Session Reconstruction

Proactive Strategies need to change internal structure of web site. To illustrate, change in source code of each dynamic web pages.

Reactive strategies need no change, used for web analytics purposes, customers give web logs of their web site and analyzed them by using this methods. Reactive methods are applicable for all web sites satisfying same log format.

Previous Reactive HeuristicsPrevious Reactive Heuristics

Page 9: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 99

Time-oriented heuristics [Spiliopoulou 98, Cooley 99-1]

Navigation-oriented heuristic Navigation-oriented heuristic [Cooley 99-1, Cooley 99-2]

Smart-SRA [Bayir 06-2] is new approach proposed in this thesis. It combines these heuristics with web topology information in order to increase the accuracy of the reconstructed sessions.

Previous Reactive HeuristicsPrevious Reactive Heuristics

Two types of reactive heuristics defined before

Page 10: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1010

Example Web Topology Graph used for Applying heuristics

P13 P1

P49

P20 P23

P34

Example Web Page Request Sequence

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 15 29 32 47

Previous Reactive HeuristicsPrevious Reactive Heuristics

The topology of web site can be represented by directed web graph.

The topology information can be extracted by using crawling module of Search engine APIs.

Page 11: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1111

Time-oriented heuristics -1

Time threshold (1 = 30 mins):

1. [P1, P20, P13, P49] (t(P1) - t(P49) = 29 < 30)

2. [P34, P23] (t(P34) - t(P23) = 15 < 30)

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 15 29 32 47

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Two types of time oriented Heuristics defined.

total duration of a discovered session is limited with a threshold 1

Example:

Page 12: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1212

Time-oriented Heuristics -2

Time threshold (2 = 10 mins):

1. [P1, P20, P13]

2. [P49, P34]

3. [P23]

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 15 29 32 47

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

The time spent on any page is limited with a threshold 2 .

That means t(Pn+1) - t(Pn) < 2

Example:

Page 13: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1313

Navigation-Oriented Heuristic

In Navigation Oriented Heuristics, when processing user request

sequence, There are two cases for Adding new page WPN+1 to a session [WP1, WP2, …, WPN]

If WPN has a hyperlink to WPN+1

[WP1, WP2, …, WPN, WPN+1]

If WPN does not have a hyperlink to WPN+1

Assume that WPKmax is the nearest page having a hyperlink to WPN+1 add backward browser moves

[WP1, WP2,…, WPN, WPN-1, WPN-2,..., WPKmax, WPN+1]

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Page 14: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1414

Navigation-Oriented Heuristic

Curent Session Condition New Page

[ ] P1

[P1] Link[P1, P20] = 1 P20

[P1, P20] Link[P20, P13] = 0

Link[P1, P13] = 1

P13

[P1, P20, P1, P13] Link[P13, P49] = 1 P49

[P1, P20, P1, P13, P49] Link[P49, P34] = 0

Link[P13, P34] = 1

P34

[P1, P20, P1, P13, P49, P13, P34] Link[P34, P23] =1 P23

[P1, P20, P1, P13, P49, P13, P34, P23]

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Example:User request sequence

Page 15: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1515

Web Mining

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Smart-SRASmart-SRA

Agent SimulatorAgent Simulator

Experimental ResultsExperimental Results

ConclusionConclusion

OUTLINE

Page 16: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1616

Smart-SRA

Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria

Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that:

– between each consecutive page pair in a session there is a hyperlink from the previous page to the next page

Topology Rule:

i:1 i<n, there is a hyperlink from Pi to Pi+1

      Time Rules: – o         i: 1 i<n, Timestam(Pi) < Timestamp(Pi+1) – o         i: 1 i<n Timestamp(Pi+1) - Timestamp(Pi) (page stay time) – o         Timestamp(Pn) - Timestamp(P1) δ (session duration time).

      

Page 17: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1717

Smart-SRA

Phase2 of Smart-SRA process a candidate session from left to right by repeating the following steps until the candidate session is empty:

1. Determine the web pages without any referrer (on its left) and remove them from the candidate session

2. For each one of these pagesFor each previously constructed session– If there is a hyperlink from the last page of the session to the web

page and page stay time constraint is satisfied then append the web page to the session

3. Remove non-maximal sessions

Page 18: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1818

Example Candidate Session

Page P1 P20 P13 P49 P34 P23

Timestamp 0 6 9 12 14 15

Smart-SRA

P13 P1

P49

P20 P23

P34

Example Web Topology

Used of Applying Smart-SRA

Page 19: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 1919

Smart-SRAIteration 1 (non referers in the set) 2

Candidate Session [P1, P20, P13, P49, P34, P23] [P20, P13, P49, P34, P23]

New Session Set(before)

[P1]

Temp Page Set {P1} {P20, P13}

Temp Session Set [P1] [P1,P20]

[P1,P13]

New Session Set(after)

[P1] [P1,P20]

[P1,P13]

Iteration 3 4

Candidate Session [P49, P34, P23] [P23]

New Session Set(before)

[P1,P20]

[P1,P13]

[P1,P13,P34]

[P1, P13, P49]

[P1, P20]

Temp Page Set {P49, P34} {P23}

Temp Session Set [P1,P13,P34]

[P1, P13, P49]

[P1, P13, P34, P23]

[P1, P13, P49, P23], [P1, P20, P23]

New Session Set(after)

[P1,P13,P34], [P1, P13, P49]

[P1, P20]

[P1, P13, P34, P23] , [P1, P13, P49, P23]

[P1, P20, P23]

Page 20: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2020

Web Mining

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Smart-SRASmart-SRA

Agent SimulatorAgent Simulator

Experimental ResultsExperimental Results

ConclusionConclusion

OUTLINE

Page 21: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2121

Agent Simulator

Models the behavior of web users and generates web user navigation and the log data kept by the web server

Used to Used to compare the performances of alternative session reconstruction heuristics

Page 22: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2222

Agent Simulator

• A Web user can start session with any one of the possible entry pages of a web site.

• A Web user can select the next page having a link from the most recently accessed page.

• A Web user can press the back button one more time and thus selects as the next page a page having a link from any one of the previously browsed pages (i.e., pages accessed before the most recently accessed one).

• A Web user can terminate his/her session.

Provides 4 basic behaviors of Web User.

Page 23: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2323

Web user can start a new session with any one of the possible entry pages of the web site

Start page

New request from server

S1 Session I

S2 Session II

P13 P1

P20 P23

P34

1

S1

P49

2

S2

Agent Simulator

Behavior I

Page 24: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2424

P13 P1

P49

P20 P23

P34

2

1

Web user can select a new page having a link from the most recently accessed page.

Agent Simulator

Behavior II

Start page

New request from server

S1 Session I

S2 Session II

Page 25: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2525

P13 P1

P49

P20 P23

P34

2

1

3

4

5

Web user can select as the next page having a link from any one of the previously browsed pages.

Agent Simulator

Behavior III

Start page

New request from server

S1 Session I

S2 Session II

Page 26: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2626

P13 P1

P49

P20 P23

P34

2

1

3

4

5

6

Web user can terminate the session.

Agent Simulator

Behavior IV

Start page

New request from server

S1 Session I

S2 Session II

Example session is terminated in P23.

Page 27: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2727

3 Parameters for simulating behavior of web user

Session Termination Probability (STP)

Link from Previous pages Probability (LPP)

New Initial page Probability (NIP)

Agent Simulator

Page 28: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2828

Web Mining

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Smart-SRASmart-SRA

Agent SimulatorAgent Simulator

Experimental ResultsExperimental Results

ConclusionConclusion

OUTLINE

Page 29: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 2929

Heuristics Tested

Time oriented heuristic (heur1)(total time 30 min)

Time oriented heuristic (heur2)(page stay 10 min)

Navigation oriented heuristic (heur3)

Smart-SRA heuristic (heur4)

Experimental Results

Page 30: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3030

Accuracy is determined as:

Reconstructed session H captures a real session Rif R occurs as a subsequence of H (R H)String-matching relation needed

R = [P1, P3, P5] H = [P9, P1, P3, P5, P8] => R H YesH = [P1, P9, P3, P5, P8] => R H No

Experimental Results

Page 31: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3131

Parameters for generating user sessions and web topology

Number of web pages (nodes) in topology 300

Average number of outdegree 15

Average number of page stay time 2,2 min

Deviation for page stay time 0,5 min

Number of agents 10000

STP : Fixed & Range 5% 1%-20%

LPP : Fixed & Range 30% 0%-90%

NIP : Fixed & Range 30% 0%-90%

Experimental Results

Page 32: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3232

Accuracy vs. STPReal Accuracy vs STP

0

10

20

30

40

50

STP

heur1

heur2

heur3

heur4

Experimental Results

Increasing STP leads to sessions with fewer pages. It becomes more easy to predict. In small length sessions the probability of LPP and NIP that holds is also small.

Page 33: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3333

Real Accuracy vs LPP

0

10

20

30

40

50

0 10 20 30 40 50 60 70 80 90

LPP

Rea

l Acc

ura

cy % heur1

heur2

heur3

heur4

Accuracy vs LPP

Experimental Results

As LPP increases the real accuracy decreases. Increasing LPP leads to more complex sessions. Intelligent Path completion is needed for discovering more accurate sessions.

Page 34: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3434

Real Accuracy vs NIP

05

101520253035

0 10 20 30 40 50 60 70 80 90

NIP

Rea

l Acc

uar

cy % heur1

heur2

heur3

heur4

Accuracy vs. NIP

Experimental Results

Increasing NIP causes more complex sessions, the accuracy decreases for all heuristics. Path separation is needed for discovering more accurate sessions.

Page 35: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3535

Web Mining

Previous Session Reconstruction HeuristicsPrevious Session Reconstruction Heuristics

Smart-SRASmart-SRA

Agent SimulatorAgent Simulator

Experimental ResultsExperimental Results

ConclusionConclusion

OUTLINE

Page 36: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3636

Conclusion

New session reconstruction heuristic: Smart-SRA– Does not allow sequences with unrelated consecutive requests

(no hyperlink between the previous one to the next one)– No artificial browser (back) requests insertion in order to prevent

unrelated consecutive requests– Only maximal sessions discovered.

Agent simulator simulates behaviors of real www users.

It is possible to evaluate accuracy of heuristics by using Agent Simulator.

Experimental results show Smart-SRA outperforms previous reactive heuristics.

Page 37: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3737

References

[Bayir 06-1] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006) A Performance Comparison of Pattern Discovery Methods on Web Log Data, AICCSA-06, the 4th ACS/IEEE International Conference on Computer Systems and Applications.

[Bayir 06-2] M. A. Bayir, I. H. Toroslu, A. Cosar, (2006): A New Approach for Reactive Web Usage Data Processing. ICDE Workshops, 44.

[Cooley 99-1] R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for Mining World Wide Web Browsing Patterns . Knowledge and Information Systems Vol. 1, No. 1.

[Cooley 99-2] R. Cooley, P. Tan and J. Srivastava (1999), Discovery of interesting usage patterns from Web data. Advances in Web Usage Analysis and User Profiling. LNAI 1836, Springer, Berlin, Germany. 163-182.

[Spiliopoulou 98] M. Spiliopoulou, L.C. Faulstich (1998). WUM: A tool for Web Utilization analysis. Proceedings EDBT workshop WebDB’98, LNCS 1590, Springer, Berlin, Germany. 184-203.

Page 38: Murat Ali Bay ı r Middle East Technical University Department of Computer Engineering

Murat Ali Bayir, June 06Murat Ali Bayir, June 06 3838

Thank you for Listening Thank you for Listening

Any Questions ? Any Questions ?