Post on 02-Oct-2020
Web Mining/Web Usage MiningMMIS 2 VU SS 2011 - 707.025
Denis Helic
KMI, TU Graz
Mar 08, 2012
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 1 / 36
Introduction
The Web is the largest data source in the world
Web Mining aims to extract and mine knowledge from the data onthe Web
Data → Information (Data in context) → Knowledge (Information incontext)
Typically, knowledge inside of human mind
Automatic extraction to prepare it for humans
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 2 / 36
Example: Navigational behavior on the Web
Study by Huberman in 1998
Strong Regularities in World Wide Web Surfing
Observing the number of links users follow on a website
Theoretical model confirmed with the log analysis of several largewebsites
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 3 / 36
Example: Navigational behavior on the Web
Figure: Number of links followed vs. number of users
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 4 / 36
Introduction
Web Mining is multidisciplinary field
Data mining, machine learning, network science
Statistics, information retrieval, multimedia, etc.
Databases, in particular NoSQL databases
Map/Reduce, GraphDB, etc.
Lack of structure, heterogeneity → very challenging task
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 5 / 36
Opportunities and Challenges
The amount of information is huge and easily accessible
The coverage of information is huge (information on anything)
All types of information exist (structured databases, text,multimedia,...)
Much of the Web information is semi-structured (HTML)
Much of the Web information is linked
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 6 / 36
Opportunities and Challenges
A lot of redundancy (copy&paste instead of linking)
A lot of noise (advertisement, copyright notices, navigation panels, ...)
A lot of Web services that provide different responses for differentrequest parameters
The Web is dynamic (information changes → snapshots)
It is virtual society → not only about data but also about people
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 7 / 36
Web Mining
Web Mining classification
Web Usage Mining
User access and interaction patterns
Search access and search interaction → search query logs
Navigation and browsing → access logs
We will deal mostly with this topic
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 8 / 36
Web Mining
Web Structure Mining
Discover knowledge from the link structure
E.g. PageRank
But also HITS algorithm
Discussed in e.g. Web Science or MMIS1
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 9 / 36
Web Mining
Web Content Mining
Mining, integration and extraction of knowledge from the Webcontent
E.g. clustering search results according to the content similarity
Sentiment analysis (positive, negative opinions, ...)
Discussed in e.g. Application areas of Knowledge Management
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 10 / 36
Web Mining
A subcategory that belongs to all other categories
Web Metadata Mining
Extraction of knowledge from the user metadata, e.g. tags
Tags are also content, tags are typically represented as links, tags area specific product of interaction with the system
But other types of metadata are possible: e.g. Wikipedia categories
We will deal with extraction of hierarchies from Web metadata
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 11 / 36
Data Sources
Web Metadata Mining
Datasets of diverse social Web sites
E.g. Wikipedia dumps
Crawls from tagging systems, e.g. delicious or flickr
Typically crawled via APIs offered by those systems
Very large files
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 12 / 36
Delicious crawl
000001b9e9e5be0c86cac873e42c2c4d
basil3whitehouse http://en.wikipedia.org/wiki/Roomano
1176073200 food cheese
00000c9d3fee7592680fa80646c36fa7
NicoC http://en.wikipedia.org/wiki/Green_Anaconda
1170720000 anacondas animalinfo
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 13 / 36
Data Sources
Web Usage Mining
Server level collection
Client level collection
Proxy level collection
Very large files and multiple files
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 14 / 36
Server level collection
2012-03-07 00:14:20,469 |INFO|
/af/AEIOU/Conrad_von_H%C3%B6tzendorf,_Franz_Freiherr|
-|66.249.66.206|
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
2012-03-07 00:14:21,026 |INFO|
/af/Wissenssammlungen/Fossilien/Escharella|
-|62.47.22.30|Mozilla/5.0
(compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 15 / 36
Server level collection
POST methods typically not logged
Cache hits not logged
Tracking of user session difficult
Cookies, query data stored in separate files → integration
Single site but multiple users
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 16 / 36
Client level collection
Javascript or plugin, extension code
E.g. Google Analytics sending client data from Javascript to Googlefor a specific site
Search toolbars for collecting search and navigation(!) paths
No problems with caching or sessions
Single or multiple sites but single user
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 17 / 36
Proxy level collection
Proxy servers in organizations
Multiple users and multiple sites
Users are anonymous
Still possible to track sessions with heuristics
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 18 / 36
Web Usage Mining Framework
Prep rocess ing
Raw Logs
Site Files
v
Preprocessed "Interesting" Ciickstream Rules, Patterns,
Data and Statistics Rules, Patterns, and Statistics
Figure 1: High Level Web Usage Mining Process
IP Address
123.456.78.9
123.456.78.9
123.456.78.9
123A56.78.9
123.456.78.9
123,456.78.9
123.456.76.9
123.456.78.9
123.458,78.9
123.456.78.9
123.456.78.9
209,458.782
209.456.78.3
Usedd Time MethodJ URU Protocol
[25/Apr/1998:03:94:41-0580] "GET A.h~l HI-FP/1.0"
[23/Apd1998:03:05:34 -0500] "GET B.html I..ITFP/1.0"
'GET Lhlrnl H'ITPI1.0" [25/April998:03:05:39,0500]
[25/April998:03:06:02 -0500]
[25/April998:03:06:58 -0580]
"GET F.html HTTP/1.ff'
"GET A.h~l HTrP/1.0'
[25/Apr/1998:03:07:42 -0500] "GET B.hlml HTTP/1.0"
[25/April998:03:07:55 -0500] "GET R.html HTTPI1.0"
[25/April998:03:09:50 -0500] "GET C.html HI-rP/1.0"
[25/April998:03:10:02..0500] "GET O.hlml HTIP/1.0"
[25/Apr/1998:03:10:45..0500] 'GET J.html HTTP/I.0"
[25/Apr/1998:03:12:23-0500] "GET G.html HTTP/I.0"
[25/,Apr/1998:05:05:22-0500] "GET A.html H'FrP/I.0"
[225/Apr/1998:05:06:03 -0500] 'GET D.h~l HTTP/1.0'
Statue
200
200
200
200
200
200
200
200
200
200
200
200
200
Size Referrer Agent
3290 Mozla/3.04 (Win95, I)
2050 A.h~l Moziga/3.94 (Win95,1)
4130 Moziga/3.94 (Win95, I)
5896 B.hlml Moziga/3.04 (Win95,1)
3290 Mozilla/3.01 {Xll, I, IRIX6.2, IP22)
2050 A.html MoziBa/3.01 (X11,I, IRIX6.2, IP22)
8140 Lhtml Mozma/3.94 (Win95,1)
1820 A.hknl Mozgla/3.01 (XI1.I, IRIX6.2,1P22)
2270 F,html MoziBa/3.94 (Win95,1)
9430 C.html Moziga/3.01 (X11,I, IRIX62, IP22)
7220 B.htnd MoziBa/3.94 (Win95,1)
3290 Mozgla/3.94 0Nin95, I)
1680 A.hb'nl Moziga/3.94(Win95,1)
Figure 2: Sample Web Server Log
SIGKDD Explorations. Jan 2000. Volume 1, Issue 2 - page 15
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 19 / 36
Web Metadata Mining
Process slightly different
E.g. instead of log data we have a raw dataset
Depending on the task there might be some additional steps
E.g. extracting a hierarchy
After the analysis apply an optimal algorithm for hierarchy extraction
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 20 / 36
Web Metadata Preprocessing
Preprocessing typically involves removing irrelevant data
Stemming
Grouping and integration of data
Sorting, etc.
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 21 / 36
Web Metadata Preprocessing
Depending on the file size (up to e.g. 40 50G) Unix shell commandsare very useful
E.g. awk, sed, sort, uniq, grep, wc, ...
Also perl
E.g. distribution of items: sort -n -r data.txt | uniq -c
Filter lines: grep -v ‘‘null’’
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 22 / 36
Web Metadata Preprocessing
Wikipedia dumps, e.g. link dump 30G
(12,0,’Alain_Badiou’),(12,0,’Albert_Camus’)
perl -p -i.bac -e "s/\((.+?),’(.+?)’,.+?’\)(,|;)/\1,\2\n/g" test.txt
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 23 / 36
Web Usage Preprocessing
Difficult because on the server we have only IP address, agent, serverclick stream
We need to identify users and sessions
Single IP but multiple sessions because of ISP proxies
Multiple IPs but single user using different machines
Multiple agents but single user even from the same machine
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 24 / 36
Web Usage Preprocessing
Assuming that each user has been identified (e.g. troughcookies/IP-agent/path analysis)
We need to extract sessions
Difficult to know when user left the site for another site
Session time-out, typically 30 minutes
Problems with client side caching
If session state is managed elsewhere difficult to know what content isserved
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 25 / 36
Web Usage Preprocessing
Session heuristics
Time heuristics
Total time must not exceed 30 minutes
Total time at a single page must not exceed 10 minutes
Path heuristics (href)
A page must be reached from a previous page in the same session
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 26 / 36
Web Usage Preprocessing
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 27 / 36
Web Usage Preprocessing
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 28 / 36
Pattern Discovery
Based on insights and algorithms from statistics, data mining,machine learning and pattern recognition
Statistical analysis, association rules, clustering, classification,sequential patterns, ...
Statistical analysis: descriptive statistics
Frequency, mean, median, mode, standard deviation, ...
E.g. access statistics, average time spent on page, etc.
Outliers detection, e.g. non-valid URLs
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 29 / 36
Pattern Discovery
Association rules: correlation statistics
Which pages are often visited in the same session
Correlation of visits to two non-linked pages
Improving the site navigation structure
Clustering: grouping similar items together
E.g. usage clusters and page clusters
Improving search results by showing similar pages
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 30 / 36
Pattern Discovery
Classification: labeling pages from a predefined set of labels
User profiling
Classifying users to product groups such as e.g. music, movies, etc.
Sequential patters: identify time-ordered sequences of visits
Can use to predict future visit patterns
Identify points of changing directions, etc.
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 31 / 36
Pattern Analysis
Visual analytics
Filter out what is not needed
Concentrate on patterns important for the task at hand
E.g. to improve navigation structure
Identify navigation sequences and navigational hubs
E.g. problems in continuing from the hubs
Potential improvements, e.g. more links, hierarchy, more hints what isbehind links, ...
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 32 / 36
Applications
Personalization
Dynamic recommendations of links, pages, products, ...
E.g. Facebook
You click a couple of times on liberal blogs posted by your liberalfriends
The conservative blogs posted by your conservative friends are notshown anymore
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 33 / 36
Applications
System improvement
Depending on patterns in accessing you might design new cachingstrategies
Also load balancing, or data distribution
Security: you might recognize malicious access, ...
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 34 / 36
Applications
Site modification
Redesigning content and structure
Better linking
More usable navigation structures
Removing of distractions, etc.
Evaluation of improvements
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 35 / 36
Further Info
Web usage mining: discovery and applications of usage patterns fromWeb data http://dl.acm.org/citation.cfm?id=846183.846188
Book Web Data Mininghttp://www.cs.uic.edu/~liub/WebMiningBook.html
Tutiorial Web Content Mininghttp://www.cs.uic.edu/~liub/WebContentMining.html
Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 36 / 36