Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

58
Improving Web Sites Improving Web Sites with Web Usage Mining, with Web Usage Mining, Web Content Mining, Web Content Mining, and Semantic Analysis and Semantic Analysis Jean-Pierre Norguet

description

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis. Jean-Pierre Norguet. Web Communication. Web transaction = requ est + r es ponse M e ta-d ata in Web logs : Request d ate et time Page reference (URI) Referral URI Client machine information. - PowerPoint PPT Presentation

Transcript of Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Page 1: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Improving Web Sites Improving Web Sites with Web Usage with Web Usage

Mining,Mining,Web Content Mining, Web Content Mining,

and Semantic Analysisand Semantic AnalysisJean-Pierre Norguet

Page 2: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

W eb siteV isitor

Log file

request

log transactionresponse

Web CommunicationWeb Communication

• Web transaction = request + response• Meta-data in Web logs:

– Request date et time– Page reference (URI)– Referral URI– Client machine information

Page 3: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

W eb site

W eb designer

Log files

100 90 80 70

R eports

W ebanalytics

tool

updateV isitors

Web Analytics ProcessWeb Analytics Process

Page 4: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Web Analytics ToolsWeb Analytics Tools

• Results– Page views– Number of visitors– Debit– Traffic

• Exploitation– Self-promotion– Sales planning– Technical resizing– Structure Optimization

Low semantics Low-level decisions

Page 5: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Organization StructureOrganization Structure

Web analytics tools

O rganizationm anager

W eb sitechief editor

Sub-editor Sub-editorSub-editor

Page 6: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Web Analytics ResultsWeb Analytics Results

• Low semantics low intuitivity• Too numerous results

Page 7: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Adress: http://www.ulb.ac.be/cgi/search

PPage age Ref. Ambiguity Ref. Ambiguity (1)(1)

Page 8: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

PPage age Ref. Ambiguity Ref. Ambiguity (2)(2)

Adress: http://www.ulb.ac.be/cgi/search

Page 9: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

PPage age VolatilityVolatilityAdress: http://www.ulb.ac.be/cgi/search

Page 10: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Page Synonymy (1)Page Synonymy (1)

Page 11: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Page Synonymy (2)Page Synonymy (2)

Page 12: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Page PolysemyPage Polysemy

Page 13: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

PPage age Temporality (1)Temporality (1)

Page 14: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

PPage age Temporality (2)Temporality (2)

Page 15: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Problems SummaryProblems Summary

• Low semantics low intuitivity• Too numerous results• Page reference ambiguity• Page synonymy• Page polysemy• Page temporality• Page volatility

Page 16: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Our solutionOur solution

• Summarized and conceptual results for:– Chief editors– Organization managers

• Generic solution, independent from:– Web site content– Web site language– Web site technology

analyze output text content

Page 17: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Output Page CollectionOutput Page Collection

• Mining points in Web environment:1. Web logs (+ content journal)2. Web server3. Network wire4. On-screen Web page

W eb server

R outer

Browser

2. S erver m onito ring

4. C lien t-s ide

3. N etwork m onitoring

1. W eb log files

In ternet

V is itor

Page 18: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Lexical AnalysisLexical Analysis

• Output page mining Web pages• Unformatting text• Tokenization terms• Stopwords removal• Stemming• Term selection index terms• Occurrence counting audience

metrics

Page 19: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

PresenceConsultation

Online pagesOutput pages

Interest

• Term occurrence counting in pages:

Term-Based MetricsTerm-Based Metrics

Page 20: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Term-Based MetricsTerm-Based Metrics

• Term-based metrics:– Consultation– Presence– Interest

• Limitations:– Too many terms– Term synonymy– Term polysemy

Ontology-based term grouping

Page 21: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Hierarchical Hierarchical AggregationAggregation

• Consultation• Presence

Apple Straw berry

Fruit

CarotPotato

Vegetable

Food

22

644

162

324

11

84

44

Page 22: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Apple Straw berry

Fruit

CarotPotato

Vegetable

Food

14101.44

644

1616

11210

11.232

32488

12721

6.0437

8422

4411

Hierarchical Hierarchical AggregationAggregation

• Consultation• Presence• Interest (x2)

Page 23: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Apple Straw berry

Fruit

CarotPotato

Vegetable

Food

14101.44

644

1616

11210

11.232

32488

12721

6.0437

8422

4411

Hierarchical Hierarchical AggregationAggregation

• Consultation• Presence• Interest (x2)

Page 24: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Data modelData model

• Ontology term hierarchy• Number of occurrences: by day, by

term• List of days (possibly aggregated)

day term

DailyTerm Occurrences

day : DATETIME term : VARCHAR consultation : INT presence : INT

OntologyElem ent

term : VARCHAR parentTerm : VARCHAR

Day

day : VARCHAR label : VARCHAR

parentTerm

Page 25: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

OLAP ModelOLAP Model

• Parent-child ontology dimension• Time dimension• Measures

Term -basedm etrics

C onsu lta tionP resence

In te res t

T im e O nto logy

Page 26: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Case StudyCase Study

• Web site: cs.ulb.ac.be– 1.500 pages– 100 page views/day– Knowledge domain: computer science

• Ontology: ACM classification– Knowledge domain: computer science– 11 top domains– 3 levels– 1230 terms

Page 27: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Experimental settingExperimental setting

• WASA prototype• SQL Server OLAP Analysis Service

V isito rs

W eb server stats server SQ L server

H TTP S erver

Logs

C ontentJourna l

W A S A

M yS Q L M yO D B C

S Q L S erver

O LA P

E xce l

Page 28: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Concept-Based MetricsConcept-Based Metrics

• Y: top ontology domains• X: consultation, presence, interest

Page 29: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ResultsResults

Page 30: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Exploitation ProcessExploitation Process

W ASA adm inistrator

chief editor

sub-editors

configures andrun

viewreports

redefine writingtasks

W ASA

defineconcepts

organization m anagerview

reports redefine W eb com m unicationobjectives

m anage organization

update W eb sitecontent

...

...

Page 31: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

SummarySummary

• Web analytics• Output page mining• Lexical analysis• Concept-based metrics with OLAP• Experiments• Conclusion & future work

Page 32: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ConclusionConclusion

• Most Web sites supported• Approach validated by experiments• Topic-based metrics are intuitive• Exploitation at higher decision levels• Limitation: ontology availability• Future work: ontology enrichment Integration into Web analytics tools

Page 33: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Thank you Thank you forfor your your attentionattention

Page 34: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Q & AQ & A

Page 35: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

• Web logs + content journal• (+) Easy to setup• (+) Minimal storage and

computation• (-) Dynamic pages

Content JournalingContent Journaling

W eb server

R outer

Browser

1. W eb log files

In ternet

V is itor

Page 36: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

• Web server plugin• (+) Dynamic pages• (+) Fast• (-) Risky

Server MonitoringServer Monitoring

W eb server

R outer

Browser

2. S erver m onito ring

In ternet

V is itor

Page 37: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

• TCP/IP packet sniffing• (+) Independent from Web server• (-) Ethernet only• (-) Encrypted content• (-) CPU-intensive

Network MonitoringNetwork Monitoring

W eb server

R outer

Browser

3. N etwork m onitoring

In ternet

V is itor

Page 38: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

• Page-embedded program1. Parses page2. Sends content to mining server

• (+) Distributed workload• (+) Supports client-side XML/XSL• (-) Visibility and vulnerability

Client-Side CollectionClient-Side Collection

W eb server

R outer

Browser

4. C lien t-s ide

In ternet

V is itor

Page 39: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Output Page CollectionOutput Page Collection

• Collection methods alone or in combination any Web site output is collectable1. Implemented: WASA-CJ2. Implemented: Sourceforge mod_trace_output

W eb server

R outer

Browser

2. S erver m onito ring

4. C lien t-s ide

3. N etwork m onitoring

1. W eb log files

In ternet

V is itor

Page 40: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ExperimentsExperiments

• Experimental settings• Visualization• Ontology coverage• Validation• Scalability

Page 41: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Experimental settingExperimental setting

• WASA prototype• SQL Server OLAP Analysis Service

V isito rs

W eb server stats server SQ L server

H TTP S erver

Logs

C ontentJourna l

W A S A

M yS Q L M yO D B C

S Q L S erver

O LA P

E xce l

Page 42: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

EUROVOC ThesaurusEUROVOC Thesaurus

• European Commission thesaurus• Knowledge domain: EC-related

domains• 21 top domains• 8 levels• 6650 terms

Page 43: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Eurovoc ExampleEurovoc Example• 04 Politics• 08 International Relations• 10 European Communities• 12 Law• 16 Economics• 20 Trade• 24 Finance• 28 Social Questions• 32 Education and Competition• 36 Science• 40 Business and Competition• 44 Employment and Working Conditions• 48 Transport• 52 Environment• 56 Agriculture, Forestry and Fisheries• 60 Agri-Foodstuffs• 64 Production, Technology and Research• 66 Energy• 68 Industry• 72 Geography• 76 International Organisations

28 SOCIAL QUESTIONS• 2806 family• 2811 migration• 2816 demography and population• 2821 social framework• 2826 social affairs• 2831 culture and religion

– arts– cultural policy– culture– acculturation– civilization– cultural difference– cultural identity

• RT: protection of minorities (1236)• RT: socio-cultural group (2821)

– cultural pluralism– popular culture– regional culture– religion

• 2836 social protection• 2841 health• 2846 construction and town planning

Page 44: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Ontology CoverageOntology Coverage

• Definition: the percentage of ontology terms that appear in the Web site

• ACM classification: 15%• Eurovoc: 0,75%• Characterizes the meaning of the

metrics ontology enrichment with terms of

the Web site

Page 45: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Collaborative Collaborative EnrichmentEnrichment

EZchief editor

EZsub-editor

MMsub-editor

EMsub-editor

SSsub-editor

PSsub-editor

JPNsub-editor

JMDsub-editor

EZorganization m anager

JTSsub-editor

JMDwebm aster

JPNW ASA adm inistrator

Page 46: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Methodology StepsMethodology Steps

• Editor browses his pages• Select new terms• Find enrichment point in the ontology• Insert terms into ontology• Editor sends ontology to chief editor• Chief editor commits the inserts

Page 47: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ResultsResults

Page 48: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ValidationValidation

• Comparison with WebTrends• Personal Web site• Optimized custom ontology of 1250

terms• Top concepts match the page

directories results should be comparable

Page 49: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ResultsResultsUrchin

WASA

Page 50: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Scalability: Case StudyScalability: Case Study

• Web site: www.ulb.ac.be– 800,000 pages– 100,000 page views– Knowledge domain: broad

• Ontology: Eurovoc– Knowledge domain: broad (EC’s interests)– 21 top domains– 8 levels– 6650 terms

• Run=15 hours, linear dependency reasonableand applicable to any Web site

Page 51: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ExperimentsExperiments

• Experimental settings• Visualization• Ontology coverage• Validation• Scalability

Page 52: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

OntologiesOntologies

• Specification of a conceptualisation• Controlled vocabulary of terms and

relations• An ontology defines concepts and their

relations, that are necessary to share, reuse, and represent a domain knowledge

• Example:Fru it

S trawberry

A pp leFoodV egetab le

good m ix

Page 53: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Ontology RestrictionOntology Restriction

• Ontology concept hierarchy

Fru itS trawberry

A pp leFoodV egetab le

Fru itS trawberry

A pp leFoodV egetab le

good m ix

Page 54: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ContentsContents

• Context & motivations• Output page mining• Lexical analysis• Concept-based metrics with OLAP• Experiments• Exploitation• Conclusion & future work

Page 55: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

ContextContext

• Web emergence• Web communication analysis• Maintenance needs effective

decisions• Highest organization levels• Summarized and conceptual results• Web analytics tools unappropriate

Page 56: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Exploitation ProcessExploitation Process

W ASA adm inistrator

chief editor

sub-editors

configures andrun

viewreports

redefine writingtasks

W ASA

defineconcepts

organization m anagerview

reports redefine W eb com m unicationobjectives

m anage organization

update W eb sitecontent

...

...

Page 57: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Metric ExploitationMetric Exploitation

• High interest – Search pages about the topic– Rank pages by consultation– Optimize pages

• Low interest– Search pages about the topic– Rank pages by presence– Question the topic: important/not

important– Drain traffic to the pages/delete pages

Page 58: Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Future WorkFuture Work

• Concept visualisation in semantic space

• Automated taxonomy enrichment• Additional OLAP dimensions