Motivations
-
Upload
richard-ayala -
Category
Documents
-
view
18 -
download
1
description
Transcript of Motivations
Benefits of InterSite Pre-Processing and Clustering Methods in E-Commerce Domain
Sergiu Chelcea, Alzennyr Da Silva, Yves Lechevallier, Doru Tanasa, Brigitte Trousse
AxIS Research TeamINRIA Sophia Antipolis and Rocquencourt
Motivations
To show on the clickstream dataset proposed for ECML/PKDD 2005 Discovery challenge
the benefits of our InterSite pre-processing method proposed by Tanasa in his PhD Thesis (2005)
And
the benefits of a new crossed clustering method developed by Lechevallier&Verde and published in (2003, 2004) on Web logs
2 main viewpoints: User and web site charge
Plan 1. Intersite Data Pre-Processing - introduction of user’s intersite visit
« Group of SessionIDs » - first statistical Intersite analysis
2. Crossed Clustering Approach - confusion table with classes of time periods and classes of product types - analysis on the most used shop: shop 4
3. Conclusions
Table 1. Format of page requests
ShopID Date IP address SessionID Page Referrer
11 1074585663 213.151.91.186 939dad92c4…84208dca /
11 1074585670 213.151.91.186 87ee02ddcff…7655bb9e /ct/?c=148 http://www.shop2.cz
Table 2. Number of requests per shop
ShopID Site name (shop) #Requests
10 www.shop1.cz 509,688
11 www.shop2.cz 400,045
12 www.shop3.cz 645,724
14 www.shop4.cz 1,290,870
15 www.shop5.cz 308,367
16 www.shop6.cz 298,030
17 www.shop7.cz 164,447
Data pre-processing
Initial data:
Table 3. Transformed log lines
Datetime IP SessionID URL Referrer
2004-01-20 09:01:03 213.151.91.186 939dad92c4…84208dca http://www.shop2.cz/ -
2004-01-20 09:01:10 213.151.91.186 87ee02ddcff…7655bb9e http://www.shop2.cz/ct/?c=148 http://www.shop2.cz/
Data pre-processing
• Data Structuration SessionID a single visit on each shop Towards the notion of user’s intersite visit: we group such SessionIDs that belongs to a single user (same IP) into a « Group of SessionIDs ». We compare the Referer with the URLs previously accessed (in a reasonable time window)
522, ,410 SessionIDs into 397,629 Groups, equivalent to a 23.88% reduction;
• Data fusion, data cleaning
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Hour
Vis
its
Monday
Tuesday
Wednesday
Thursday
Friday
Saturday
Sunday
0
50
100
150
200
250
300
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
HourG
roup
s
Fig. 1. Visits per days and hours: (a) globally, (b) multi-shop
Data pre-processing
• Low number of new visits on Saturdays and Sundays during the lunch time• The high number of new visits on Tuesdays and Wednesdays• Same results a) and b)
Crossed Clustering Aproach for Time Periods/Product Analysis
Data: Selection of ls pages in shop 4 (the most used)
Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)
0
200 000
400 000
600 000
800 000
1 000 000
1 200 000
1 400 000
10 11 12 14 15 16 17
Shop
Acc
ess
/ct /ls /dt /znacka /akce others
Crossed Clustering Aproach for Time Periods/Product Analysis
Relational BD model : We add easily a crossed table
Line: an individual (weekday, one hour) 7 days X 24 hours = 168 individuals
Column: a multi-categorical variable representing the number of products requested
by users into the specific time slice
Method developed by Yves Lechevallier & Rosanna Verde (2003,2004)
Crossed Clustering Aproach for Time Periods/Product Analysis
Table 4. Quantity of products requested by weekday x hour and registered on shop 4
Weekday x Hour Product (number of requests)
Monday_0Built-in electric hobs (10),Built-in dish washers 60cm (64),Corner single sinks (50), ...
Monday_1
Free standing combi refrigerators (44),Corner single sinks (50), Built-in hoods (60), ...
… …
Sunday_22Built-in microwave ovens (27),Built-in dish washers 45cm (38),Built-in dish washers 60cm (85), ...
Sunday_23Built-in freezers (56),Kitchen taps with shower (45), Garbage disposers (32), ...
Crossed Clustering Aproach for Time Period/Product Analysis
Table 5. Confusion table
Product_1 Product _2 Product _3 Product _4 Product _5 Total
Period_ 1 2847 5084 3284 2265 2471 15951
Period_ 2 11305 31492 12951 1895 9610 67253
Period _3 33107 55652 36699 5345 20370 151173
Period _4 22682 46322 30200 5165 27659 132028
Period _5 9576 20477 19721 2339 7551 59664
Period _6 1783 3515 2549 392 11240 19479
Period _7 15019 14297 8608 1397 6014 45335
Total 96319 176839 114012 18798 84915 490883
57,7%
Crossed Clustering Aproach for Time Period/Product Analysis
Example of one surprising result:
the class Product 5 is defined by one type of products « Free standing combi refrigerators »
consulted predominantly on Fridays from 17:00 to 20:00 (class period 6)
57,7% of such a product type requested on this period
Conclusions
1. Intersite Data Pre-Processing - structuration into user’s intersite visits
« Group of SessionIDs » - first statistical Intersite analysis
- anomalies and recommandations for the dataset
2. Crossed Clustering Approach - first application of such a method on time periods of Web logs
and in e-commerce domain - promising results
Data pre-processing
Inconsistency problems:
- table kategorie: found repeated entries and different entries with same ID
- for some page types (dt, df) the given parameter represented actually a specific product, not the given product description (from products table).
- extra parameters equivalent to the give ones for some page types:i.e. for ct page type, id is equivalent to the given c parameter
- missing values (descriptions) in tables: 3 values in product table and 64 in category table
- multiple site SessionIDs: 13 cross-server visits had same SessionID on the visited sites (up to 4 sites); SessionID should change on each new site;
- multiple IP SessionIDs: 3690 visits (SessionIDs) were done from more than one IP (anonymization proxies ?).